action #101033
closedopenqaworker13: Too many Minion job failures alert - sqlite failed: database is locked size:M
0%
Description
Observation¶
the alert mentioned in the title triggered today. Looking at the worker shows me 103 failed minion jobs. All in "cache_assets" and "cache_tests" with a message similar to:
---
args:
- 7425737
- hdd
- carwos-x86-avicenzi_carwos2-230527.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2021-10-15T08:59:48Z
delayed: 2021-10-15T08:59:48Z
expires: ~
finished: 2021-10-15T09:00:45Z
id: 29755
lax: 0
notes:
lock: carwos-x86-avicenzi_carwos2-230527.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: |
DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 71.
retried: ~
retries: 0
started: 2021-10-15T08:59:53Z
state: failed
task: cache_asset
time: 2021-10-15T09:19:06Z
worker: 23
Acceptance Criteria¶
AC1: Understand, why this issue appears and how we could avoid it in the future
Suggestions¶
- @kraih seems to know more. Maybe talk to him about possible solutions
- not visible in the test https://openqa.suse.de/tests/7425737
Workaround¶
Manually retrigger if found and no issue shows up
Files
Updated by okurz about 3 years ago
- Priority changed from Normal to High
- Target version set to Ready
Updated by livdywan about 3 years ago
- Subject changed from openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked to openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by kraih about 3 years ago
- Assignee set to kraih
I'll take a look since i'm already familiar with the code.
Updated by ggardet_arm about 3 years ago
- Related to action #101373: worker openqa-aarch64 fails on cache added
Updated by kraih about 3 years ago
Judging by #101373 it seemed likely that we are also dealing with slow disk i/o here. But currently it doesn't look too bad on this machine.
$ sudo hdparm --direct -Tt /dev/sda2
/dev/sda2:
Timing O_DIRECT cached reads: 494 MB in 2.00 seconds = 246.78 MB/sec
Timing O_DIRECT disk reads: 646 MB in 3.00 seconds = 215.08 MB/sec
$ dd if=/dev/zero of=test.file bs=64M count=16 oflag=dsync
16+0 records in
16+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.7405 s, 123 MB/s
Updated by openqa_review about 3 years ago
- Due date set to 2021-11-09
Setting due date based on mean cycle time of SUSE QE Tools
Updated by kraih about 3 years ago
In the journal there was one more entry for this issue. Unfortunately the surrounding log messages do not contain any hints at the problem.
Oct 23 09:32:47 openqaworker13 openqa-worker-cacheservice-minion[1210]: [7486] [i] [#10925] Downloading "....qcow2" from "http://....qcow2"
Oct 23 09:35:01 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Downloading: "....qcow2"
Oct 23 09:35:42 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Oct 23 09:35:42 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Downloading "....qcow2" from "http://....qcow2"
Oct 23 09:36:17 openqaworker13 openqa-worker-cacheservice-minion[1210]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 71.
Oct 23 09:36:17 openqaworker13 openqa-worker-cacheservice-minion[1210]: at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Oct 23 09:37:16 openqaworker13 openqa-worker-cacheservice-minion[1210]: [9224] [i] [#10930] Downloading: "SLE-12-..."
Oct 23 09:37:17 openqaworker13 openqa-worker-cacheservice-minion[1210]: [9256] [i] [#10931] Downloading: "SLE-15-..."
Updated by kraih about 3 years ago
Checked the Grafana stats around Oct 23 09:36:17
and everything appears perfectly normal. No CPU, Memory or Disk I/O spikes.
Updated by kraih about 3 years ago
Opened a PR to find out exactly which of our queries are so slow. https://github.com/os-autoinst/openQA/pull/4338
Updated by okurz about 3 years ago
- Description updated (diff)
PR merged. So we can await feedback from production. I suggest to keep it on "High" as long as we need manual handling of the cases.
Updated by kraih about 3 years ago
- Status changed from In Progress to Feedback
Waiting for more data.
Updated by kraih about 3 years ago
We have caught a few cases of slow queries on the machine already. And of these the first one seems relevant to this issue:
Oct 30 02:20:34 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] [#9201] Downloading: "sle-15....qcow2"
Oct 30 02:21:41 openqaworker13 openqa-worker-cacheservice-minion[1797]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Oct 30 02:21:41 openqaworker13 openqa-worker-cacheservice-minion[1797]: at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Oct 30 02:25:59 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 66812ms
Oct 30 02:25:59 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] [#9201] Cache size of "/var/lib/openqa/cache" is 43 GiB, with limit 50 GiB
Oct 30 09:02:38 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22191] [i] [#9564] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Oct 30 09:02:38 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22191] [i] [#9564] Downloading "sle-15....qcow2" from "http://openqa.suse.de/tests/7571788/asset/hdd/sle-15....qcow2"
Oct 30 09:03:37 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22727] [i] [#9568] Downloading: "SLE-15-SP...-Installer...-DVD1.iso"
Oct 30 09:04:44 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23151] [i] [#9569] Downloading: "SLE-15-SP...-Installer...-DVD1.iso"
Oct 30 09:05:28 openqaworker13 openqa-worker-cacheservice-minion[1797]: [1797] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 65518ms
Oct 30 09:05:43 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23768] [i] [#9570] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Oct 30 09:05:43 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23768] [i] [#9570] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Nov 01 23:11:28 openqaworker13 openqa-worker-cacheservice-minion[17543]: [9753] [i] [#12566] Downloading: "SLE-15...-x86_64-Build58.2-Media1.iso.sha256"
Nov 01 23:11:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [9914] [i] [#12567] Downloading: "SLE-15...-x86_64-Build58.2-Media1.iso.sha256"
Nov 01 23:13:02 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 67948ms
Nov 01 23:13:02 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10337] [i] [#12568] Downloading: "sle-15...-node01.qcow2"
Nov 01 23:13:07 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10341] [i] [#12569] Downloading: "sle-15...-node02.qcow2"
While the cases seem very different, and only one actually caused the database is locked
error, the fact that the query times are so close together seems too odd to be a coincidence.
Updated by kraih about 3 years ago
The current busy timeout we use is 10 minutes, so a query would have to be blocked for that long to get a database is locked
error. And the only slow queries we've seen so far were around 66 seconds. That means it is probably not one of our queries that blocks SQLite here.
Updated by kraih about 3 years ago
A few more.
Nov 02 15:29:43 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20824] [i] [#13411] Downloading: "sle-15-....qcow2"
Nov 02 15:30:05 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20879] [i] [#13412] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 15:30:05 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20879] [i] [#13412] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 15:32:36 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 63728ms
Nov 02 15:32:37 openqaworker13 openqa-worker-cacheservice-minion[17543]: [21889] [i] [#13413] Downloading: "SLE-15-...-Media1.iso.sha256"
Nov 02 15:40:04 openqaworker13 openqa-worker-cacheservice-minion[17543]: [24030] [i] [#13424] Downloading: "sle-15....qcow2"
Nov 02 15:42:40 openqaworker13 openqa-worker-cacheservice-minion[17543]: [24718] [i] [#13426] Downloading: "SLE-15-...-Media1.iso"
Nov 02 15:42:44 openqaworker13 openqa-worker-cacheservice-minion[17543]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Cach...
Nov 02 15:43:59 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 66977ms
Nov 02 15:44:26 openqaworker13 openqa-worker-cacheservice-minion[17543]: [25237] [i] [#13427] Downloading: "SLE-15-...-Media1.iso"
Nov 02 16:21:43 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Downloading: "SLE-12-...-Media1.iso"
Nov 02 16:21:47 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10107] [i] [#13454] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
Nov 02 16:21:47 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10107] [i] [#13454] Downloading "SLE....qc...
Nov 02 16:23:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Cache size of "/var/lib/openqa/cache" is 42 GiB, with limit 50 GiB
Nov 02 16:23:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Downloading "SLE-....iso" from "http:>
Nov 02 16:24:12 openqaworker13 openqa-worker-cacheservice-minion[17543]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Cach
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 94961ms
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] [#13458] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] [#13458] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:26:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [7075] [i] [#13451] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
Nov 02 16:26:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [7075] [i] [#13451] Downloading "SLE-....iso" from "http://o>
Nov 02 16:27:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 68067ms
Nov 02 16:27:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 65283ms
Nov 02 16:27:38 openqaworker13 openqa-worker-cacheservice-minion[17543]: [12190] [i] [#13459] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:27:38 openqaworker13 openqa-worker-cacheservice-minion[17543]: [12190] [i] [#13459] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 148561ms
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [13647] [i] [#13460] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [13647] [i] [#13460] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:33:04 openqaworker13 openqa-worker-cacheservice-minion[17543]: [14424] [i] [#13463] Downloading: "openSUSE-Leap-....iso"
Nov 04 08:42:06 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 69671ms
Nov 04 08:42:54 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19962] [i] [#15787] Downloading: "SLE-15....iso"
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 67921ms
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] [#15785] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] [#15785] Downloading "SUSE-MicroOS....raw.xz" from "http://openqa.suse.de/tests/7604213/asset/hdd/SUSE-MicroOS...>
Nov 04 08:46:45 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 68109ms
Nov 04 08:46:45 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 85983ms
Nov 04 08:47:50 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18955] [i] [#15784] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:47:50 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18955] [i] [#15784] Downloading "SLE-12-....iso" from "http://openqa.suse.de/tests/.../asset/iso/SLE-12-...1>
Nov 04 08:48:09 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Downloading: "SLE-15-....iso.sha256"
Nov 04 08:49:05 openqaworker13 openqa-worker-cacheservice-minion[22356]: [21647] [i] [#15789] Downloading: "SLES-15-....qcow2"
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 75424ms
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Downloading "SLE-15-....iso.sha256" from "http://openqa.suse.de/tests/.../asset/other/...>
Nov 04 08:50:07 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18650] [i] [#15783] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:50:07 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18650] [i] [#15783] Downloading "SLE-15-....iso" from "http://openqa.suse.de/tests/7603975/asset/iso/...>
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 88407ms
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22474] [i] [#15792] Downloading: "openSUSE-Leap-....iso"
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22048] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 89355ms
Nov 05 03:23:39 openqaworker13 openqa-worker-cacheservice-minion[22356]: [26949] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 05 03:23:39 openqaworker13 openqa-worker-cacheservice-minion[22356]: where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 83326ms
It really happens a lot now on the machine. And i'm starting to see a few patterns emerge here. Almost all slow queries are actually from Minion::Backend::SQLite
, not our cache service model. And BEGIN IMMEDIATE TRANSACTION
originates from the Minion manager process. That leads me to the conclusion that this is an upstream performance issue in Minion::Backend::SQLite
and can probably be fixed by profiling its queries to identify the bottlenecks.
Updated by kraih about 3 years ago
Also looked through the journals of most of our other OSD workers and didn't find any slow query cases yet. That seems a bit odd.
Updated by kraih about 3 years ago
Forgot to mention, thanks to the slow query logging we do now know that it is the cache service REST API service that makes slow queries that block the Minion jobs.
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select count(case when state = 'inactive' and (expires is null or expires > datetime('now'))
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: then 1 end) as inactive_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: count(case state when 'active' then 1 end) as active_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: count(case state when 'failed' then 1 end) as failed_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: count(case state when 'finished' then 1 end) as finished_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: count(case when state = 'inactive' and delayed > datetime('now')
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: then 1 end) as delayed_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: (select count(*) from minion_locks where expires > datetime('now'))
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: as active_locks,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: count(distinct case when state = 'active' then worker end)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: as active_workers,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: ifnull((select seq from sqlite_sequence where name = 'minion_jobs'), 0)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: as enqueued_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: (select count(*) from minion_workers) as inactive_workers, null as uptime
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: from minion_jobs" -> 230715ms
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select id, args, attempts,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: (select json_group_array(distinct child.id)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: from minion_jobs as child, json_each(child.parents) as parent_id
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: where j.id = parent_id.value) as children,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: lax, notes, parents, priority, queue, result,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',retried) as retried, retries,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',started) as started, state, task,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s','now') as time, worker
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: from minion_jobs as j
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 96147ms
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select id, args, attempts,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: (select json_group_array(distinct child.id)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: from minion_jobs as child, json_each(child.parents) as parent_id
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: where j.id = parent_id.value) as children,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: lax, notes, parents, priority, queue, result,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',retried) as retried, retries,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s',started) as started, state, task,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: strftime('%s','now') as time, worker
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: from minion_jobs as j
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 69435ms
The frequency of these is very very high.
Updated by kraih about 3 years ago
Since an automatic restart of the machine on Sunday the problem appears to have vanished however. It does not look like anything significant was updated that day, just some libvirt packages. In our Grafana metrics the only noteworthy change is that the load avg of the machine went down significantly since the restart, not going above 10 where it could previously reach 200+.
Updated by kraih about 3 years ago
In almost all cases the responsible Minion::Backend::SQLite
methods are list_jobs
and stats
. Which might be interesting if this problem appears again, because stats
should only be called for monitoring. That means we could reduce the frequency of telegraf requests to the REST API service to mitigate the problem.
Updated by kraih about 3 years ago
And the queries are unoptimised, not even stats
uses an index.
sqlite> explain query plan select count(case when state = 'inactive' and (expires is null or expires > datetime('now')) then 1 end) as inactive_jobs, count(case state when 'active' then 1 end) as active_jobs, count(case state when 'failed' then 1 end) as failed_jobs, count(case state when 'finished' then 1 end) as finished_jobs, count(case when state = 'inactive' and delayed > datetime('now') then 1 end) as delayed_jobs, (select count(*) from minion_locks where expires > datetime('now')) as active_locks, count(distinct case when state = 'active' then worker end) as active_workers, ifnull((select seq from sqlite_sequence where name = 'minion_jobs'), 0) as enqueued_jobs, (select count(*) from minion_workers) as inactive_workers, null as uptime from minion_jobs;
QUERY PLAN
|--USE TEMP B-TREE FOR count(DISTINCT)
|--SCAN minion_jobs
|--SCALAR SUBQUERY 1
| `--SCAN minion_locks
|--SCALAR SUBQUERY 2
| `--SCAN sqlite_sequence
`--SCALAR SUBQUERY 3
`--SCAN minion_workers
Updated by kraih about 3 years ago
Reported the performance bottleneck upstream too. https://github.com/Grinnz/Minion-Backend-SQLite/issues/18
Updated by livdywan about 3 years ago
- Due date changed from 2021-11-09 to 2021-11-12
I'm assuming this wasn't resolved yesterday, hence bumping the due date ;-)
Nice find on the upstream issue. Are you planning to fix it upstream then?
Updated by kraih about 3 years ago
Lowering priority because the machine has been fine since Sunday.
Updated by kraih about 3 years ago
cdywan wrote:
Nice find on the upstream issue. Are you planning to fix it upstream then?
I've opened a PR for that. https://github.com/Grinnz/Minion-Backend-SQLite/pull/19
Updated by kraih about 3 years ago
- File worker13.png worker13.png added
Looking at the load average over the past week, it seems the machine is back to normal.
Updated by kraih about 3 years ago
- Status changed from Feedback to Resolved
No changes over the weekend, still everything normal. Guess we can consider this issue resolved. The upstream patch for Minion::Backend::SQLite
has also been merged and should be in production soon.