action #120744
closed[alert] QA-Power8-5-kvm: Too many Minion job failures alert size:M
Added by okurz about 2 years ago. Updated about 2 years ago.
0%
Description
Observation¶
Too many Minion jobs have failed on QA-Power8-5-kvm. Review failed jobs on http://localhost:9530/minion/jobs?state=failed after tunneling the worker's Minion dashboard via
ssh -L 9530:localhost:9530 -N QA-Power8-5-kvm
. Create a ticket if there's not already one. For the general log of the Minion job queue, checkoutjournalctl -u openqa-worker-cacheservice.service -u openqa-worker-cacheservice-minion.service
. To remove all failed jobs on the machine:/usr/share/openqa/script/openqa-workercache eval 'my $jobs = app->minion->jobs({states => ["failed"]}); while (my $job = $jobs->next) { $job->remove }'
Metric name
Value
Failed
101.000
Acceptance criteria¶
- AC1: The cause of sqlite lock errors is known
Rollback steps¶
- Unpause alert "QA-Power8-5-kvm: Too many Minion job failures alert"
Suggestions¶
- Consider implementing a retry with exponential backoff
- Exit code 11 is a SEGFAULT, suggesting this is due to a C dependency
Files
cache-service-sqlite-test.t (593 Bytes) cache-service-sqlite-test.t | mkittler, 2022-11-25 15:58 |
Updated by mkittler about 2 years ago
I have removed all but the 10 most recent failed jobs. I have uploaded the database before the removal to https://w3.suse.de/~mkittler/openQA/cache.sqlite in case one wants to have a look. These failures have piled up since a month and we have basically two different types of them:
DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
- This issue has apparently resolved itself when it happened (unless somebody released the lock manually). It happened multiple times.
- Jobs currently don't run into it. The last affected job is from 11 hours ago. So there's no need to cleanup a stale lock right now. However, I'm wondering why this happens. The machine hasn't been rebooted for 5 days so it cannot be a symptom of e.g. a crash of the whole system.
Job terminated unexpectedly (exit code: 11, signal: 0)
- There are many of those errors. Not sure what exit code 11 means in this context. Both,
cache_tests
andcache_assets
are affected. - Sometimes it is also
Job terminated unexpectedly (exit code: 2, signal: 0)
- There are many of those errors. Not sure what exit code 11 means in this context. Both,
Note that line 77 is where we set the journaling mode right after establishing the database "connection":
$sqlite->on(
connection => sub {
my ($sqlite, $dbh) = @_;
# default to using DELETE journaling mode to avoid database corruption seen in production (see poo#67000)
# checkout https://www.sqlite.org/pragma.html#pragma_journal_mode for possible values
my $sqlite_mode = uc($ENV{OPENQA_CACHE_SERVICE_SQLITE_JOURNAL_MODE} || 'DELETE');
$dbh->do("pragma journal_mode=$sqlite_mode");
I presume that the DB is really locked by a writer at that time but instead of waiting until the DB isn't locked anymore it fails immediately.
I found a few tickets that are similar to our problem: https://www.perlmonks.org/?node_id=1073462, https://rt.cpan.org/Public/Bug/Display.html?id=11680, https://stackoverflow.com/questions/41143514/dbdsqlite-database-is-locked-how-to-retry
Before trying to fix the problem it would be nice to come up with some way to reproduce it reliably.
Sometimes the message also occurs in the journal and sometimes there's also a slow query being logged:
Nov 16 18:23:31 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107662]: [107662] [i] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Nov 16 18:23:31 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107662]: [107662] [i] Downloading "SLE-15-SP5-Full-ppc64le-Build29.1-Media1.iso.sha256" from "http://openqa.suse.de/tests/9976147/asset/other/SLE-15-SP5-Full-ppc64le-Build29.1-Media1.iso.sha256"
Nov 16 18:24:06 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107612]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:24:06 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107612]: at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Nov 16 18:25:22 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107887]: [107887] [i] Downloading: "upgrade_from-12-SP5-sle-15-SP5-ppc64le-Build29.1-alpha-alpha-node01-HA-BV-offline_dvd-yaml.qcow2"
Nov 16 18:26:29 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107998]: [107998] [i] Downloading: "ha_supportserver_upgrade_sle_12-SP5_ppc64le.qcow2"
Nov 16 18:27:37 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108091]: [108091] [i] Downloading: "ha_supportserver_upgrade_sle_12-SP5_ppc64le_luns.qcow2"
Nov 16 18:28:44 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108404]: [108404] [i] Downloading: "SLE-15-SP5-Full-ppc64le-Build29.1-Media1.iso"
Nov 16 18:29:50 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108482]: [108482] [i] Downloading: "SLES-15-SP1-ppc64le-Installtest-SAP.qcow2"
Nov 16 18:30:16 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108479]: [108479] [i] Downloading: "SLES-15-ppc64le-containers.qcow2"
Nov 16 18:31:32 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108482]: [108482] [i] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Nov 16 18:31:32 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108482]: [108482] [i] Downloading "SLES-15-SP1-ppc64le-Installtest-SAP.qcow2" from "http://openqa.suse.de/tests/9974613/asset/hdd/SLES-15-SP1-ppc64le-Installtest-SAP.qcow2"
Nov 16 18:32:19 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108738]: [108738] [i] Downloading: "SLES-15-ppc64le-Installtest.qcow2"
Nov 16 18:32:57 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108658]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:32:57 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108658]: at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Nov 16 18:34:17 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108738]: [108738] [i] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Nov 16 18:34:17 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108738]: [108738] [i] Downloading "SLES-15-ppc64le-Installtest.qcow2" from "http://openqa.suse.de/tests/9974627/asset/hdd/SLES-15-ppc64le-Installtest.qcow2"
Nov 16 18:34:54 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108822]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:36:22 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108990]: [108990] [i] Downloading: "SLE-15-Installer-DVD-ppc64le-GM-DVD1.iso"
Nov 16 18:36:42 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108990]: [108990] [i] Cache size of "/var/lib/openqa/cache" is 44 GiB, with limit 50 GiB
Nov 16 18:36:42 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108990]: [108990] [i] Downloading "SLE-15-Installer-DVD-ppc64le-GM-DVD1.iso" from "http://openqa.suse.de/tests/9974623/asset/iso/SLE-15-Installer-DVD-ppc64le-GM-DVD1.iso"
Nov 16 18:36:49 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108404]: [108404] [i] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Nov 16 18:36:49 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108404]: [108404] [i] Downloading "SLE-15-SP5-Full-ppc64le-Build29.1-Media1.iso" from "http://openqa.suse.de/tests/9976147/asset/iso/SLE-15-SP5-Full-ppc64le-Build29.1-Media1.iso"
Nov 16 18:38:16 QA-Power8-5-kvm openqa-worker-cacheservice-minion[109289]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:38:27 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107578]: [107578] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 75051ms
Nov 16 18:40:21 QA-Power8-5-kvm openqa-worker-cacheservice-minion[111111]: [111111] [i] [#17025] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Nov 16 18:40:21 QA-Power8-5-kvm openqa-worker-cacheservice-minion[111111]: [111111] [i] [#17025] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Nov 16 18:51:22 QA-Power8-5-kvm openqa-worker-cacheservice-minion[112229]: [112229] [i] Downloading: "SLES-12-SP5-ppc64le-Installtest-SAP.qcow2"
Nov 16 18:51:59 QA-Power8-5-kvm openqa-worker-cacheservice-minion[112229]: [112229] [i] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Those are all "DB locked" and "Slow SQLite query" messages:
martchus@QA-Power8-5-kvm:~> sudo journalctl -u openqa-worker-cacheservice-minion.service | grep -E 'Slow SQLite query|database is locked'
Okt 31 02:05:59 QA-Power8-5-kvm openqa-worker-cacheservice-minion[63373]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 02 17:49:56 QA-Power8-5-kvm openqa-worker-cacheservice-minion[77805]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 02 18:49:05 QA-Power8-5-kvm openqa-worker-cacheservice-minion[91302]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 05 06:34:34 QA-Power8-5-kvm openqa-worker-cacheservice-minion[109592]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 05 06:42:53 QA-Power8-5-kvm openqa-worker-cacheservice-minion[12232]: [12232] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 88640ms
Nov 05 06:54:04 QA-Power8-5-kvm openqa-worker-cacheservice-minion[12232]: [12232] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 70478ms
Nov 05 06:55:34 QA-Power8-5-kvm openqa-worker-cacheservice-minion[114321]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 06 02:26:43 QA-Power8-5-kvm openqa-worker-cacheservice-minion[12232]: [12232] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 71846ms
Nov 07 16:45:08 QA-Power8-5-kvm openqa-worker-cacheservice-minion[72909]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 09 02:07:39 QA-Power8-5-kvm openqa-worker-cacheservice-minion[21281]: [21281] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 64930ms
Nov 10 16:08:23 QA-Power8-5-kvm openqa-worker-cacheservice-minion[57310]: [57310] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 67910ms
Nov 11 02:25:35 QA-Power8-5-kvm openqa-worker-cacheservice-minion[10630]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 11 12:28:59 QA-Power8-5-kvm openqa-worker-cacheservice-minion[57310]: [57310] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 76625ms
Nov 12 06:29:44 QA-Power8-5-kvm openqa-worker-cacheservice-minion[57310]: [57310] [i] Slow SQLite query: "COMMIT TRANSACTION" -> 90667ms
Nov 12 10:21:38 QA-Power8-5-kvm openqa-worker-cacheservice-minion[50107]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 12 18:26:03 QA-Power8-5-kvm openqa-worker-cacheservice-minion[40380]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 12 20:32:29 QA-Power8-5-kvm openqa-worker-cacheservice-minion[57310]: [57310] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 65761ms
Nov 15 01:23:32 QA-Power8-5-kvm openqa-worker-cacheservice-minion[3097]: [3097] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 83072ms
Nov 16 03:15:51 QA-Power8-5-kvm openqa-worker-cacheservice-minion[57128]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 07:09:52 QA-Power8-5-kvm openqa-worker-cacheservice-minion[7062]: [7062] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 69348ms
Nov 16 09:51:59 QA-Power8-5-kvm openqa-worker-cacheservice-minion[84385]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 09:55:27 QA-Power8-5-kvm openqa-worker-cacheservice-minion[7062]: [7062] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 84901ms
Nov 16 10:45:14 QA-Power8-5-kvm openqa-worker-cacheservice-minion[7062]: [7062] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 83778ms
Nov 16 12:50:25 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107578]: [107578] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 74640ms
Nov 16 18:24:06 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107612]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:32:57 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108658]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:34:54 QA-Power8-5-kvm openqa-worker-cacheservice-minion[108822]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:38:16 QA-Power8-5-kvm openqa-worker-cacheservice-minion[109289]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Nov 16 18:38:27 QA-Power8-5-kvm openqa-worker-cacheservice-minion[107578]: [107578] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 75051ms
The other errors might be due to a timeout when the cacheservice services are restarted (and thus jobs are aborted uncleanly). The worker has been running for 5 days but the cacheservice services have only been running for 8 hours (so they're restarted more frequently). On the service restart there's nothing very interesting logged, though. At least it doesn't say it had to kill the service:
Nov 18 05:18:22 QA-Power8-5-kvm openqa-worker-cacheservice-minion[83774]: [83774] [i] Downloading: "sle-15-SP3-ppc64le-5.3.18-150300.252.1.g29df9dc-Server-DVD-Incidents-Kernel-KOTD@ppc64le-virtio-with-ltp.qcow2"
Nov 18 05:18:32 QA-Power8-5-kvm openqa-worker-cacheservice-minion[83777]: [83777] [i] Downloading: "SLE-15-SP3-Installer-DVD-ppc64le-GM-DVD1.iso"
Nov 18 05:18:38 QA-Power8-5-kvm openqa-worker-cacheservice-minion[83779]: [83779] [i] [#17406] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Nov 18 05:18:38 QA-Power8-5-kvm openqa-worker-cacheservice-minion[83779]: [83779] [i] [#17406] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Nov 18 06:16:58 QA-Power8-5-kvm systemd[1]: Stopping OpenQA Worker Cache Service Minion...
Nov 18 06:17:01 QA-Power8-5-kvm openqa-worker-cacheservice-minion[35832]: [35832] [i] Worker 35832 stopped
Nov 18 06:17:01 QA-Power8-5-kvm systemd[1]: openqa-worker-cacheservice-minion.service: Main process exited, code=exited, status=216/GROUP
Nov 18 06:17:01 QA-Power8-5-kvm systemd[1]: openqa-worker-cacheservice-minion.service: Failed with result 'exit-code'.
Nov 18 06:17:01 QA-Power8-5-kvm systemd[1]: Stopped OpenQA Worker Cache Service Minion.
Nov 18 06:17:01 QA-Power8-5-kvm systemd[1]: Started OpenQA Worker Cache Service Minion.
Nov 18 09:59:48 QA-Power8-5-kvm openqa-worker-cacheservice-minion[93767]: [93767] [i] Cache size of "/var/lib/openqa/cache" is 49 GiB, with limit 50 GiB
Nov 18 09:59:48 QA-Power8-5-kvm openqa-worker-cacheservice-minion[93767]: [93767] [i] Resetting all leftover locks after restart
Nov 18 09:59:48 QA-Power8-5-kvm openqa-worker-cacheservice-minion[93767]: [93767] [i] Worker 93767 started
Nov 18 09:59:48 QA-Power8-5-kvm openqa-worker-cacheservice-minion[7517]: [7517] [i] Downloading: "SLES-15-SP4-ppc64le-Installtest.qcow2"
Updated by mkittler about 2 years ago
At least DBD::SQLite::db do failed: database is locked
is still happening but it still isn't a persistent issue.
There's now also Job terminated unexpectedly (exit code: 0, signal: 15)
. That's likely just because the worker was terminated via SIGTERM, although it would be nice if that would be handled more gracefully in general.
Updated by mkittler about 2 years ago
QA-Power8-4-kvm has also jobs piling up with DBD::SQLite::db do failed: database is locked
and the Job terminated unexpectedly (exit code: 0, signal: 15)
- although it seems that the frequency this happens isn't as high. On powerqaworker-qam-1.qa.suse.de there are failed jobs as well but not ones with DBD::SQLite::db do failed: database is locked
(only Job terminated …
and Worker went away
). I've also checked x86_64/arm workers (worker2, worker3, worker10, openqaworker-arm-{1,2,3}.suse.de) for comparison and haven't found any of such failures. However, it maybe doesn't mean much because on such workers we wipe the cache when rebooting and the last reboot was only one day ago. So these are likely still general issues.
Updated by mkittler about 2 years ago
I've checked the x86_64 workers again (which have now been online for 3 days since the last wipe) and couldn't find any failures but Job terminated unexpectedly (exit code: 0, signal: 15)
.
Since it could be an SQLite problem I've checked which version of perl-DBD-SQLite
we're using in production. It is 1.70-lp154.4.9
(or 1.70-lp153.6.2
on power8-4-kvm as it is on Leap 15.3). The package comes from devel:openQA
and bundles SQLite (by linking statically against it). This version hasn't been updated for over a year. All of this counts for x86_64 and PowerPC machines. (I find it a bit strange that we have https://build.opensuse.org/package/view_file/devel:openQA:Leap:15.4/perl-DBD-SQLite/perl-DBD-SQLite-use-external-sqlite3.patch?expand=1 in our devel repo but the SQLite.so
of the resulting package is nevertheless not linked against the system provided sqlite library and instead significantly bigger. So this patch is not used/effective. Note that the perl-DBD-SQLite
package provided by TW is linking against the system provided sqlite library. Considering that this is not what upstream prefers/tests using their bundles SQLite might be the better option anyways, though.)
EDIT: Ah, the patch is conditionally applied:
# MANUAL BEGIN
%if 0%{?sle_version} >= 140000 && 0%{?sle_version} <= 150400
patch -p1 --reverse <%{PATCH0}
%endif
# MANUAL END
That makes sense I suppose.
Updated by livdywan about 2 years ago
- Subject changed from [alert] QA-Power8-5-kvm: Too many Minion job failures alert to [alert] QA-Power8-5-kvm: Too many Minion job failures alert size:M
- Description updated (diff)
- Status changed from New to In Progress
Updated by mkittler about 2 years ago
Exit code 11 is a SEGFAULT, suggesting this is due to a C dependency
coredumpctl
is installed on the host but there are no dumps. Also note that signal 11 is SIGSEGV but exit code 11 doesn't mean the same¹. What it means depends entirely on the application and I don't know that for the Minion worker processes. So I'd say it isn't actually crashing here.
¹ I've just did a quick test with an app dereferening NULL. The return code was 139 in that case (and not 11). A quick research also shows that exit code 139 is associated with SIGSEGV.
Updated by openqa_review about 2 years ago
- Due date set to 2022-12-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler about 2 years ago
It could be really just an exclusive lock that's held too long. I've just did a small test (see attachment for code) and a 2nd process seems indeed not very patient waiting for another process to release the lock (resulting in the same error we see on qa-power8 workers).
Updated by mkittler about 2 years ago
- Status changed from In Progress to Feedback
Updated by mkittler about 2 years ago
The PR has been merged and deployed on 2022-11-29T05:59Z and the last occurence of the database is locked
issue was on 2022-11-29T02:01:44Z. I'll re-check tomorrow and if it looks still good then I'm resolving the issue.
(The other failures I've found don't seems to happen only on power workers and they're happening with a much lower frequency. So I'm not considering them within this ticket.)
Updated by mkittler about 2 years ago
- Status changed from Feedback to Resolved
The last failed Minion job is still the one mentioned in my last comment. So I've cleaned up jobs that had piled up before the fix has been deployed on all power workers. With that I consider this ticket resolved.