Project

General

Profile

action #101033

openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked size:M

Added by nicksinger about 2 months ago. Updated 15 days ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2021-10-15
Due date:
2021-11-12
% Done:

0%

Estimated time:

Description

Observation

the alert mentioned in the title triggered today. Looking at the worker shows me 103 failed minion jobs. All in "cache_assets" and "cache_tests" with a message similar to:

---
args:
- 7425737
- hdd
- carwos-x86-avicenzi_carwos2-230527.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2021-10-15T08:59:48Z
delayed: 2021-10-15T08:59:48Z
expires: ~
finished: 2021-10-15T09:00:45Z
id: 29755
lax: 0
notes:
  lock: carwos-x86-avicenzi_carwos2-230527.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: |
  DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 71.
retried: ~
retries: 0
started: 2021-10-15T08:59:53Z
state: failed
task: cache_asset
time: 2021-10-15T09:19:06Z
worker: 23

Acceptance Criteria

AC1: Understand, why this issue appears and how we could avoid it in the future

Suggestions

Workaround

Manually retrigger if found and no issue shows up

worker13.png (146 KB) worker13.png kraih, 2021-11-12 14:44
12167

Related issues

Related to openQA Infrastructure - action #101373: worker openqa-aarch64 fails on cacheResolved2021-10-22

History

#1 Updated by nicksinger about 2 months ago

  • Description updated (diff)

#2 Updated by okurz about 1 month ago

  • Priority changed from Normal to High
  • Target version set to Ready

#3 Updated by cdywan about 1 month ago

  • Subject changed from openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked to openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by kraih about 1 month ago

  • Assignee set to kraih

I'll take a look since i'm already familiar with the code.

#5 Updated by ggardet_arm about 1 month ago

  • Related to action #101373: worker openqa-aarch64 fails on cache added

#6 Updated by kraih about 1 month ago

  • Status changed from Workable to In Progress

#7 Updated by kraih about 1 month ago

Judging by #101373 it seemed likely that we are also dealing with slow disk i/o here. But currently it doesn't look too bad on this machine.

$ sudo hdparm --direct -Tt /dev/sda2
/dev/sda2:
 Timing O_DIRECT cached reads:   494 MB in  2.00 seconds = 246.78 MB/sec
 Timing O_DIRECT disk reads: 646 MB in  3.00 seconds = 215.08 MB/sec
$ dd if=/dev/zero of=test.file bs=64M count=16 oflag=dsync
16+0 records in
16+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.7405 s, 123 MB/s

#8 Updated by openqa_review about 1 month ago

  • Due date set to 2021-11-09

Setting due date based on mean cycle time of SUSE QE Tools

#9 Updated by kraih about 1 month ago

In the journal there was one more entry for this issue. Unfortunately the surrounding log messages do not contain any hints at the problem.

Oct 23 09:32:47 openqaworker13 openqa-worker-cacheservice-minion[1210]: [7486] [i] [#10925] Downloading "....qcow2" from "http://....qcow2"
Oct 23 09:35:01 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Downloading: "....qcow2"
Oct 23 09:35:42 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Oct 23 09:35:42 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Downloading "....qcow2" from "http://....qcow2"
Oct 23 09:36:17 openqaworker13 openqa-worker-cacheservice-minion[1210]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 71.
Oct 23 09:36:17 openqaworker13 openqa-worker-cacheservice-minion[1210]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Oct 23 09:37:16 openqaworker13 openqa-worker-cacheservice-minion[1210]: [9224] [i] [#10930] Downloading: "SLE-12-..."
Oct 23 09:37:17 openqaworker13 openqa-worker-cacheservice-minion[1210]: [9256] [i] [#10931] Downloading: "SLE-15-..."

#10 Updated by kraih about 1 month ago

Checked the Grafana stats around Oct 23 09:36:17 and everything appears perfectly normal. No CPU, Memory or Disk I/O spikes.

#11 Updated by kraih about 1 month ago

Opened a PR to find out exactly which of our queries are so slow. https://github.com/os-autoinst/openQA/pull/4338

#12 Updated by okurz about 1 month ago

  • Description updated (diff)

PR merged. So we can await feedback from production. I suggest to keep it on "High" as long as we need manual handling of the cases.

#13 Updated by kraih about 1 month ago

  • Status changed from In Progress to Feedback

Waiting for more data.

#14 Updated by kraih 28 days ago

We have caught a few cases of slow queries on the machine already. And of these the first one seems relevant to this issue:

Oct 30 02:20:34 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] [#9201] Downloading: "sle-15....qcow2"
Oct 30 02:21:41 openqaworker13 openqa-worker-cacheservice-minion[1797]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Oct 30 02:21:41 openqaworker13 openqa-worker-cacheservice-minion[1797]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Oct 30 02:25:59 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 66812ms
Oct 30 02:25:59 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] [#9201] Cache size of "/var/lib/openqa/cache" is 43 GiB, with limit 50 GiB
Oct 30 09:02:38 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22191] [i] [#9564] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Oct 30 09:02:38 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22191] [i] [#9564] Downloading "sle-15....qcow2" from "http://openqa.suse.de/tests/7571788/asset/hdd/sle-15....qcow2"
Oct 30 09:03:37 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22727] [i] [#9568] Downloading: "SLE-15-SP...-Installer...-DVD1.iso"
Oct 30 09:04:44 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23151] [i] [#9569] Downloading: "SLE-15-SP...-Installer...-DVD1.iso"
Oct 30 09:05:28 openqaworker13 openqa-worker-cacheservice-minion[1797]: [1797] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 65518ms
Oct 30 09:05:43 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23768] [i] [#9570] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Oct 30 09:05:43 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23768] [i] [#9570] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Nov 01 23:11:28 openqaworker13 openqa-worker-cacheservice-minion[17543]: [9753] [i] [#12566] Downloading: "SLE-15...-x86_64-Build58.2-Media1.iso.sha256"
Nov 01 23:11:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [9914] [i] [#12567] Downloading: "SLE-15...-x86_64-Build58.2-Media1.iso.sha256"
Nov 01 23:13:02 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 67948ms
Nov 01 23:13:02 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10337] [i] [#12568] Downloading: "sle-15...-node01.qcow2"
Nov 01 23:13:07 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10341] [i] [#12569] Downloading: "sle-15...-node02.qcow2"

While the cases seem very different, and only one actually caused the database is locked error, the fact that the query times are so close together seems too odd to be a coincidence.

#15 Updated by kraih 27 days ago

The current busy timeout we use is 10 minutes, so a query would have to be blocked for that long to get a database is locked error. And the only slow queries we've seen so far were around 66 seconds. That means it is probably not one of our queries that blocks SQLite here.

#16 Updated by kraih 25 days ago

A few more.

Nov 02 15:29:43 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20824] [i] [#13411] Downloading: "sle-15-....qcow2"
Nov 02 15:30:05 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20879] [i] [#13412] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 15:30:05 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20879] [i] [#13412] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 15:32:36 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 63728ms
Nov 02 15:32:37 openqaworker13 openqa-worker-cacheservice-minion[17543]: [21889] [i] [#13413] Downloading: "SLE-15-...-Media1.iso.sha256"
Nov 02 15:40:04 openqaworker13 openqa-worker-cacheservice-minion[17543]: [24030] [i] [#13424] Downloading: "sle-15....qcow2"
Nov 02 15:42:40 openqaworker13 openqa-worker-cacheservice-minion[17543]: [24718] [i] [#13426] Downloading: "SLE-15-...-Media1.iso"
Nov 02 15:42:44 openqaworker13 openqa-worker-cacheservice-minion[17543]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Cach...
Nov 02 15:43:59 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 66977ms
Nov 02 15:44:26 openqaworker13 openqa-worker-cacheservice-minion[17543]: [25237] [i] [#13427] Downloading: "SLE-15-...-Media1.iso"
Nov 02 16:21:43 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Downloading: "SLE-12-...-Media1.iso"
Nov 02 16:21:47 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10107] [i] [#13454] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
Nov 02 16:21:47 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10107] [i] [#13454] Downloading "SLE....qc...
Nov 02 16:23:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Cache size of "/var/lib/openqa/cache" is 42 GiB, with limit 50 GiB
Nov 02 16:23:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Downloading "SLE-....iso" from "http:>
Nov 02 16:24:12 openqaworker13 openqa-worker-cacheservice-minion[17543]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Cach
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]:     where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 94961ms
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] [#13458] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] [#13458] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:26:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [7075] [i] [#13451] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
Nov 02 16:26:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [7075] [i] [#13451] Downloading "SLE-....iso" from "http://o>
Nov 02 16:27:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 68067ms
Nov 02 16:27:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 65283ms
Nov 02 16:27:38 openqaworker13 openqa-worker-cacheservice-minion[17543]: [12190] [i] [#13459] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:27:38 openqaworker13 openqa-worker-cacheservice-minion[17543]: [12190] [i] [#13459] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 148561ms
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [13647] [i] [#13460] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [13647] [i] [#13460] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:33:04 openqaworker13 openqa-worker-cacheservice-minion[17543]: [14424] [i] [#13463] Downloading: "openSUSE-Leap-....iso"
Nov 04 08:42:06 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 69671ms
Nov 04 08:42:54 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19962] [i] [#15787] Downloading: "SLE-15....iso"
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 67921ms
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] [#15785] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] [#15785] Downloading "SUSE-MicroOS....raw.xz" from "http://openqa.suse.de/tests/7604213/asset/hdd/SUSE-MicroOS...>
Nov 04 08:46:45 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 68109ms
Nov 04 08:46:45 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 85983ms
Nov 04 08:47:50 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18955] [i] [#15784] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:47:50 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18955] [i] [#15784] Downloading "SLE-12-....iso" from "http://openqa.suse.de/tests/.../asset/iso/SLE-12-...1>
Nov 04 08:48:09 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Downloading: "SLE-15-....iso.sha256"
Nov 04 08:49:05 openqaworker13 openqa-worker-cacheservice-minion[22356]: [21647] [i] [#15789] Downloading: "SLES-15-....qcow2"
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 75424ms
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Downloading "SLE-15-....iso.sha256" from "http://openqa.suse.de/tests/.../asset/other/...>
Nov 04 08:50:07 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18650] [i] [#15783] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:50:07 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18650] [i] [#15783] Downloading "SLE-15-....iso" from "http://openqa.suse.de/tests/7603975/asset/iso/...>
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 88407ms
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22474] [i] [#15792] Downloading: "openSUSE-Leap-....iso"
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22048] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]:     where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 89355ms
Nov 05 03:23:39 openqaworker13 openqa-worker-cacheservice-minion[22356]: [26949] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 05 03:23:39 openqaworker13 openqa-worker-cacheservice-minion[22356]:     where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 83326ms

It really happens a lot now on the machine. And i'm starting to see a few patterns emerge here. Almost all slow queries are actually from Minion::Backend::SQLite, not our cache service model. And BEGIN IMMEDIATE TRANSACTION originates from the Minion manager process. That leads me to the conclusion that this is an upstream performance issue in Minion::Backend::SQLite and can probably be fixed by profiling its queries to identify the bottlenecks.

#17 Updated by kraih 25 days ago

  • Status changed from Feedback to In Progress

#18 Updated by kraih 25 days ago

Also looked through the journals of most of our other OSD workers and didn't find any slow query cases yet. That seems a bit odd.

#19 Updated by kraih 21 days ago

Forgot to mention, thanks to the slow query logging we do now know that it is the cache service REST API service that makes slow queries that block the Minion jobs.

Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select count(case when state = 'inactive' and (expires is null or expires > datetime('now'))
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         then 1 end) as inactive_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case state when 'active' then 1 end) as active_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case state when 'failed' then 1 end) as failed_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case state when 'finished' then 1 end) as finished_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case when state = 'inactive' and delayed > datetime('now')
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         then 1 end) as delayed_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       (select count(*) from minion_locks where expires > datetime('now'))
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         as active_locks,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(distinct case when state = 'active' then worker end)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         as active_workers,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       ifnull((select seq from sqlite_sequence where name = 'minion_jobs'), 0)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         as enqueued_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       (select count(*) from minion_workers) as inactive_workers, null as uptime
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       from minion_jobs" -> 230715ms
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select id, args, attempts,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        (select json_group_array(distinct child.id)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          from minion_jobs as child, json_each(child.parents) as parent_id
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          where j.id = parent_id.value) as children,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        lax, notes, parents, priority, queue, result,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',retried) as retried, retries,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',started) as started, state, task,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s','now') as time, worker
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        from minion_jobs as j
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 96147ms
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select id, args, attempts,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        (select json_group_array(distinct child.id)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          from minion_jobs as child, json_each(child.parents) as parent_id
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          where j.id = parent_id.value) as children,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        lax, notes, parents, priority, queue, result,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',retried) as retried, retries,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',started) as started, state, task,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s','now') as time, worker
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        from minion_jobs as j
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 69435ms

The frequency of these is very very high.

#20 Updated by kraih 21 days ago

Since an automatic restart of the machine on Sunday the problem appears to have vanished however. It does not look like anything significant was updated that day, just some libvirt packages. In our Grafana metrics the only noteworthy change is that the load avg of the machine went down significantly since the restart, not going above 10 where it could previously reach 200+.

#21 Updated by kraih 21 days ago

In almost all cases the responsible Minion::Backend::SQLite methods are list_jobs and stats. Which might be interesting if this problem appears again, because stats should only be called for monitoring. That means we could reduce the frequency of telegraf requests to the REST API service to mitigate the problem.

#22 Updated by kraih 21 days ago

And the queries are unoptimised, not even stats uses an index.

sqlite> explain query plan select count(case when state = 'inactive' and (expires is null or expires > datetime('now')) then 1 end) as inactive_jobs, count(case state when 'active' then 1 end) as active_jobs, count(case state when 'failed' then 1 end) as failed_jobs, count(case state when 'finished' then 1 end) as finished_jobs, count(case when state = 'inactive' and delayed > datetime('now') then 1 end) as delayed_jobs, (select count(*) from minion_locks where expires > datetime('now')) as active_locks, count(distinct case when state = 'active' then worker end) as active_workers, ifnull((select seq from sqlite_sequence where name = 'minion_jobs'), 0) as enqueued_jobs, (select count(*) from minion_workers) as inactive_workers, null as uptime from minion_jobs;
QUERY PLAN
|--USE TEMP B-TREE FOR count(DISTINCT)
|--SCAN minion_jobs
|--SCALAR SUBQUERY 1
|  `--SCAN minion_locks
|--SCALAR SUBQUERY 2
|  `--SCAN sqlite_sequence
`--SCALAR SUBQUERY 3
   `--SCAN minion_workers

#23 Updated by kraih 21 days ago

Reported the performance bottleneck upstream too. https://github.com/Grinnz/Minion-Backend-SQLite/issues/18

#24 Updated by cdywan 20 days ago

  • Due date changed from 2021-11-09 to 2021-11-12

I'm assuming this wasn't resolved yesterday, hence bumping the due date ;-)

Nice find on the upstream issue. Are you planning to fix it upstream then?

#25 Updated by kraih 20 days ago

  • Priority changed from High to Low

#26 Updated by kraih 20 days ago

Lowering priority because the machine has been fine since Sunday.

#27 Updated by kraih 18 days ago

cdywan wrote:

Nice find on the upstream issue. Are you planning to fix it upstream then?

I've opened a PR for that. https://github.com/Grinnz/Minion-Backend-SQLite/pull/19

#28 Updated by kraih 18 days ago

12167

Looking at the load average over the past week, it seems the machine is back to normal.

#29 Updated by kraih 18 days ago

  • Status changed from In Progress to Feedback

#30 Updated by kraih 15 days ago

  • Status changed from Feedback to Resolved

No changes over the weekend, still everything normal. Guess we can consider this issue resolved. The upstream patch for Minion::Backend::SQLite has also been merged and should be in production soon.

Also available in: Atom PDF