Project

General

Profile

Actions

action #101033

closed

openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked size:M

Added by nicksinger over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Low
Assignee:
Category:
-
Target version:
Start date:
2021-10-15
Due date:
2021-11-12
% Done:

0%

Estimated time:

Description

Observation

the alert mentioned in the title triggered today. Looking at the worker shows me 103 failed minion jobs. All in "cache_assets" and "cache_tests" with a message similar to:

---
args:
- 7425737
- hdd
- carwos-x86-avicenzi_carwos2-230527.qcow2
- openqa.suse.de
attempts: 1
children: []
created: 2021-10-15T08:59:48Z
delayed: 2021-10-15T08:59:48Z
expires: ~
finished: 2021-10-15T09:00:45Z
id: 29755
lax: 0
notes:
  lock: carwos-x86-avicenzi_carwos2-230527.qcow2.openqa.suse.de
parents: []
priority: 0
queue: default
result: |
  DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 71.
retried: ~
retries: 0
started: 2021-10-15T08:59:53Z
state: failed
task: cache_asset
time: 2021-10-15T09:19:06Z
worker: 23

Acceptance Criteria

AC1: Understand, why this issue appears and how we could avoid it in the future

Suggestions

Workaround

Manually retrigger if found and no issue shows up


Files

worker13.png (146 KB) worker13.png kraih, 2021-11-12 14:44

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #101373: worker openqa-aarch64 fails on cacheResolvedmkittler2021-10-22

Actions
Actions #1

Updated by nicksinger over 2 years ago

  • Description updated (diff)
Actions #2

Updated by okurz over 2 years ago

  • Priority changed from Normal to High
  • Target version set to Ready
Actions #3

Updated by livdywan over 2 years ago

  • Subject changed from openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked to openqaworker13: Too many Minion job failures alert - sqlite failed: database is locked size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by kraih over 2 years ago

  • Assignee set to kraih

I'll take a look since i'm already familiar with the code.

Actions #5

Updated by ggardet_arm over 2 years ago

  • Related to action #101373: worker openqa-aarch64 fails on cache added
Actions #6

Updated by kraih over 2 years ago

  • Status changed from Workable to In Progress
Actions #7

Updated by kraih over 2 years ago

Judging by #101373 it seemed likely that we are also dealing with slow disk i/o here. But currently it doesn't look too bad on this machine.

$ sudo hdparm --direct -Tt /dev/sda2
/dev/sda2:
 Timing O_DIRECT cached reads:   494 MB in  2.00 seconds = 246.78 MB/sec
 Timing O_DIRECT disk reads: 646 MB in  3.00 seconds = 215.08 MB/sec
$ dd if=/dev/zero of=test.file bs=64M count=16 oflag=dsync
16+0 records in
16+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.7405 s, 123 MB/s
Actions #8

Updated by openqa_review over 2 years ago

  • Due date set to 2021-11-09

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by kraih over 2 years ago

In the journal there was one more entry for this issue. Unfortunately the surrounding log messages do not contain any hints at the problem.

Oct 23 09:32:47 openqaworker13 openqa-worker-cacheservice-minion[1210]: [7486] [i] [#10925] Downloading "....qcow2" from "http://....qcow2"
Oct 23 09:35:01 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Downloading: "....qcow2"
Oct 23 09:35:42 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Cache size of "/var/lib/openqa/cache" is 46 GiB, with limit 50 GiB
Oct 23 09:35:42 openqaworker13 openqa-worker-cacheservice-minion[1210]: [8545] [i] [#10927] Downloading "....qcow2" from "http://....qcow2"
Oct 23 09:36:17 openqaworker13 openqa-worker-cacheservice-minion[1210]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 71.
Oct 23 09:36:17 openqaworker13 openqa-worker-cacheservice-minion[1210]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Oct 23 09:37:16 openqaworker13 openqa-worker-cacheservice-minion[1210]: [9224] [i] [#10930] Downloading: "SLE-12-..."
Oct 23 09:37:17 openqaworker13 openqa-worker-cacheservice-minion[1210]: [9256] [i] [#10931] Downloading: "SLE-15-..."
Actions #10

Updated by kraih over 2 years ago

Checked the Grafana stats around Oct 23 09:36:17 and everything appears perfectly normal. No CPU, Memory or Disk I/O spikes.

Actions #11

Updated by kraih over 2 years ago

Opened a PR to find out exactly which of our queries are so slow. https://github.com/os-autoinst/openQA/pull/4338

Actions #12

Updated by okurz over 2 years ago

  • Description updated (diff)

PR merged. So we can await feedback from production. I suggest to keep it on "High" as long as we need manual handling of the cases.

Actions #13

Updated by kraih over 2 years ago

  • Status changed from In Progress to Feedback

Waiting for more data.

Actions #14

Updated by kraih over 2 years ago

We have caught a few cases of slow queries on the machine already. And of these the first one seems relevant to this issue:

Oct 30 02:20:34 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] [#9201] Downloading: "sle-15....qcow2"
Oct 30 02:21:41 openqaworker13 openqa-worker-cacheservice-minion[1797]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/CacheService.pm line 77.
Oct 30 02:21:41 openqaworker13 openqa-worker-cacheservice-minion[1797]:  at /usr/lib/perl5/vendor_perl/5.26.1/Minion/Command/minion/worker.pm line 28.
Oct 30 02:25:59 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 66812ms
Oct 30 02:25:59 openqaworker13 openqa-worker-cacheservice-minion[1797]: [12928] [i] [#9201] Cache size of "/var/lib/openqa/cache" is 43 GiB, with limit 50 GiB
Oct 30 09:02:38 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22191] [i] [#9564] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Oct 30 09:02:38 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22191] [i] [#9564] Downloading "sle-15....qcow2" from "http://openqa.suse.de/tests/7571788/asset/hdd/sle-15....qcow2"
Oct 30 09:03:37 openqaworker13 openqa-worker-cacheservice-minion[1797]: [22727] [i] [#9568] Downloading: "SLE-15-SP...-Installer...-DVD1.iso"
Oct 30 09:04:44 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23151] [i] [#9569] Downloading: "SLE-15-SP...-Installer...-DVD1.iso"
Oct 30 09:05:28 openqaworker13 openqa-worker-cacheservice-minion[1797]: [1797] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 65518ms
Oct 30 09:05:43 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23768] [i] [#9570] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.suse.de"
Oct 30 09:05:43 openqaworker13 openqa-worker-cacheservice-minion[1797]: [23768] [i] [#9570] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --delete /var/lib/openqa/cache/openqa.suse.de/tests/
Nov 01 23:11:28 openqaworker13 openqa-worker-cacheservice-minion[17543]: [9753] [i] [#12566] Downloading: "SLE-15...-x86_64-Build58.2-Media1.iso.sha256"
Nov 01 23:11:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [9914] [i] [#12567] Downloading: "SLE-15...-x86_64-Build58.2-Media1.iso.sha256"
Nov 01 23:13:02 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 67948ms
Nov 01 23:13:02 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10337] [i] [#12568] Downloading: "sle-15...-node01.qcow2"
Nov 01 23:13:07 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10341] [i] [#12569] Downloading: "sle-15...-node02.qcow2"

While the cases seem very different, and only one actually caused the database is locked error, the fact that the query times are so close together seems too odd to be a coincidence.

Actions #15

Updated by kraih over 2 years ago

The current busy timeout we use is 10 minutes, so a query would have to be blocked for that long to get a database is locked error. And the only slow queries we've seen so far were around 66 seconds. That means it is probably not one of our queries that blocks SQLite here.

Actions #16

Updated by kraih over 2 years ago

A few more.

Nov 02 15:29:43 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20824] [i] [#13411] Downloading: "sle-15-....qcow2"
Nov 02 15:30:05 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20879] [i] [#13412] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 15:30:05 openqaworker13 openqa-worker-cacheservice-minion[17543]: [20879] [i] [#13412] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 15:32:36 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 63728ms
Nov 02 15:32:37 openqaworker13 openqa-worker-cacheservice-minion[17543]: [21889] [i] [#13413] Downloading: "SLE-15-...-Media1.iso.sha256"
Nov 02 15:40:04 openqaworker13 openqa-worker-cacheservice-minion[17543]: [24030] [i] [#13424] Downloading: "sle-15....qcow2"
Nov 02 15:42:40 openqaworker13 openqa-worker-cacheservice-minion[17543]: [24718] [i] [#13426] Downloading: "SLE-15-...-Media1.iso"
Nov 02 15:42:44 openqaworker13 openqa-worker-cacheservice-minion[17543]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Cach...
Nov 02 15:43:59 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 66977ms
Nov 02 15:44:26 openqaworker13 openqa-worker-cacheservice-minion[17543]: [25237] [i] [#13427] Downloading: "SLE-15-...-Media1.iso"
Nov 02 16:21:43 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Downloading: "SLE-12-...-Media1.iso"
Nov 02 16:21:47 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10107] [i] [#13454] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
Nov 02 16:21:47 openqaworker13 openqa-worker-cacheservice-minion[17543]: [10107] [i] [#13454] Downloading "SLE....qc...
Nov 02 16:23:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Cache size of "/var/lib/openqa/cache" is 42 GiB, with limit 50 GiB
Nov 02 16:23:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11072] [i] [#13456] Downloading "SLE-....iso" from "http:>
Nov 02 16:24:12 openqaworker13 openqa-worker-cacheservice-minion[17543]: DBD::SQLite::db do failed: database is locked at /usr/share/openqa/script/../lib/OpenQA/Cach
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]:     where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 94961ms
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] [#13458] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:25:21 openqaworker13 openqa-worker-cacheservice-minion[17543]: [11472] [i] [#13458] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:26:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [7075] [i] [#13451] Cache size of "/var/lib/openqa/cache" is 47 GiB, with limit 50 GiB
Nov 02 16:26:56 openqaworker13 openqa-worker-cacheservice-minion[17543]: [7075] [i] [#13451] Downloading "SLE-....iso" from "http://o>
Nov 02 16:27:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 68067ms
Nov 02 16:27:09 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 65283ms
Nov 02 16:27:38 openqaworker13 openqa-worker-cacheservice-minion[17543]: [12190] [i] [#13459] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:27:38 openqaworker13 openqa-worker-cacheservice-minion[17543]: [12190] [i] [#13459] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [17543] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 148561ms
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [13647] [i] [#13460] Sync: "rsync://openqa.suse.de/tests" to "/var/lib/openqa/cache/openqa.s>
Nov 02 16:31:14 openqaworker13 openqa-worker-cacheservice-minion[17543]: [13647] [i] [#13460] Calling: rsync -avHP --timeout 1800 rsync://openqa.suse.de/tests/ --del>
Nov 02 16:33:04 openqaworker13 openqa-worker-cacheservice-minion[17543]: [14424] [i] [#13463] Downloading: "openSUSE-Leap-....iso"
Nov 04 08:42:06 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 69671ms
Nov 04 08:42:54 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19962] [i] [#15787] Downloading: "SLE-15....iso"
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 67921ms
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] [#15785] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:44:30 openqaworker13 openqa-worker-cacheservice-minion[22356]: [19050] [i] [#15785] Downloading "SUSE-MicroOS....raw.xz" from "http://openqa.suse.de/tests/7604213/asset/hdd/SUSE-MicroOS...>
Nov 04 08:46:45 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 68109ms
Nov 04 08:46:45 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 85983ms
Nov 04 08:47:50 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18955] [i] [#15784] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:47:50 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18955] [i] [#15784] Downloading "SLE-12-....iso" from "http://openqa.suse.de/tests/.../asset/iso/SLE-12-...1>
Nov 04 08:48:09 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Downloading: "SLE-15-....iso.sha256"
Nov 04 08:49:05 openqaworker13 openqa-worker-cacheservice-minion[22356]: [21647] [i] [#15789] Downloading: "SLES-15-....qcow2"
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] Slow SQLite query: "SELECT "filename", "etag", "last_use", "size" FROM "assets" WHERE "filename" = ?" -> 75424ms
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Cache size of "/var/lib/openqa/cache" is 48 GiB, with limit 50 GiB
Nov 04 08:49:43 openqaworker13 openqa-worker-cacheservice-minion[22356]: [20970] [i] [#15788] Downloading "SLE-15-....iso.sha256" from "http://openqa.suse.de/tests/.../asset/other/...>
Nov 04 08:50:07 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18650] [i] [#15783] Cache size of "/var/lib/openqa/cache" is 50 GiB, with limit 50 GiB
Nov 04 08:50:07 openqaworker13 openqa-worker-cacheservice-minion[22356]: [18650] [i] [#15783] Downloading "SLE-15-....iso" from "http://openqa.suse.de/tests/7603975/asset/iso/...>
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22356] [i] Slow SQLite query: "BEGIN IMMEDIATE TRANSACTION" -> 88407ms
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22474] [i] [#15792] Downloading: "openSUSE-Leap-....iso"
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]: [22048] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 04 08:52:21 openqaworker13 openqa-worker-cacheservice-minion[22356]:     where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 89355ms
Nov 05 03:23:39 openqaworker13 openqa-worker-cacheservice-minion[22356]: [26949] [i] Slow SQLite query: "select count(*) from minion_jobs as j
Nov 05 03:23:39 openqaworker13 openqa-worker-cacheservice-minion[22356]:     where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now'))" -> 83326ms

It really happens a lot now on the machine. And i'm starting to see a few patterns emerge here. Almost all slow queries are actually from Minion::Backend::SQLite, not our cache service model. And BEGIN IMMEDIATE TRANSACTION originates from the Minion manager process. That leads me to the conclusion that this is an upstream performance issue in Minion::Backend::SQLite and can probably be fixed by profiling its queries to identify the bottlenecks.

Actions #17

Updated by kraih over 2 years ago

  • Status changed from Feedback to In Progress
Actions #18

Updated by kraih over 2 years ago

Also looked through the journals of most of our other OSD workers and didn't find any slow query cases yet. That seems a bit odd.

Actions #19

Updated by kraih over 2 years ago

Forgot to mention, thanks to the slow query logging we do now know that it is the cache service REST API service that makes slow queries that block the Minion jobs.

Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select count(case when state = 'inactive' and (expires is null or expires > datetime('now'))
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         then 1 end) as inactive_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case state when 'active' then 1 end) as active_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case state when 'failed' then 1 end) as failed_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case state when 'finished' then 1 end) as finished_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(case when state = 'inactive' and delayed > datetime('now')
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         then 1 end) as delayed_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       (select count(*) from minion_locks where expires > datetime('now'))
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         as active_locks,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       count(distinct case when state = 'active' then worker end)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         as active_workers,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       ifnull((select seq from sqlite_sequence where name = 'minion_jobs'), 0)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:         as enqueued_jobs,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       (select count(*) from minion_workers) as inactive_workers, null as uptime
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:       from minion_jobs" -> 230715ms
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select id, args, attempts,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        (select json_group_array(distinct child.id)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          from minion_jobs as child, json_each(child.parents) as parent_id
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          where j.id = parent_id.value) as children,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        lax, notes, parents, priority, queue, result,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',retried) as retried, retries,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',started) as started, state, task,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s','now') as time, worker
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        from minion_jobs as j
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 96147ms
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]: [26681] [i] Slow SQLite query: "select id, args, attempts,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        (select json_group_array(distinct child.id)
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          from minion_jobs as child, json_each(child.parents) as parent_id
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:          where j.id = parent_id.value) as children,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',created) as created, strftime('%s',delayed) as delayed,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',expires) as expires, strftime('%s',finished) as finished,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        lax, notes, parents, priority, queue, result,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',retried) as retried, retries,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s',started) as started, state, task,
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        strftime('%s','now') as time, worker
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        from minion_jobs as j
Nov 07 01:36:24 openqaworker13 openqa-workercache-daemon[25535]:        where id in (?) and (state != 'inactive' or expires is null or expires > datetime('now')) order by id desc limit ? offset ?" -> 69435ms

The frequency of these is very very high.

Actions #20

Updated by kraih over 2 years ago

Since an automatic restart of the machine on Sunday the problem appears to have vanished however. It does not look like anything significant was updated that day, just some libvirt packages. In our Grafana metrics the only noteworthy change is that the load avg of the machine went down significantly since the restart, not going above 10 where it could previously reach 200+.

Actions #21

Updated by kraih over 2 years ago

In almost all cases the responsible Minion::Backend::SQLite methods are list_jobs and stats. Which might be interesting if this problem appears again, because stats should only be called for monitoring. That means we could reduce the frequency of telegraf requests to the REST API service to mitigate the problem.

Actions #22

Updated by kraih over 2 years ago

And the queries are unoptimised, not even stats uses an index.

sqlite> explain query plan select count(case when state = 'inactive' and (expires is null or expires > datetime('now')) then 1 end) as inactive_jobs, count(case state when 'active' then 1 end) as active_jobs, count(case state when 'failed' then 1 end) as failed_jobs, count(case state when 'finished' then 1 end) as finished_jobs, count(case when state = 'inactive' and delayed > datetime('now') then 1 end) as delayed_jobs, (select count(*) from minion_locks where expires > datetime('now')) as active_locks, count(distinct case when state = 'active' then worker end) as active_workers, ifnull((select seq from sqlite_sequence where name = 'minion_jobs'), 0) as enqueued_jobs, (select count(*) from minion_workers) as inactive_workers, null as uptime from minion_jobs;
QUERY PLAN
|--USE TEMP B-TREE FOR count(DISTINCT)
|--SCAN minion_jobs
|--SCALAR SUBQUERY 1
|  `--SCAN minion_locks
|--SCALAR SUBQUERY 2
|  `--SCAN sqlite_sequence
`--SCALAR SUBQUERY 3
   `--SCAN minion_workers
Actions #23

Updated by kraih over 2 years ago

Reported the performance bottleneck upstream too. https://github.com/Grinnz/Minion-Backend-SQLite/issues/18

Actions #24

Updated by livdywan over 2 years ago

  • Due date changed from 2021-11-09 to 2021-11-12

I'm assuming this wasn't resolved yesterday, hence bumping the due date ;-)

Nice find on the upstream issue. Are you planning to fix it upstream then?

Actions #25

Updated by kraih over 2 years ago

  • Priority changed from High to Low
Actions #26

Updated by kraih over 2 years ago

Lowering priority because the machine has been fine since Sunday.

Actions #27

Updated by kraih over 2 years ago

cdywan wrote:

Nice find on the upstream issue. Are you planning to fix it upstream then?

I've opened a PR for that. https://github.com/Grinnz/Minion-Backend-SQLite/pull/19

Actions #28

Updated by kraih over 2 years ago

Looking at the load average over the past week, it seems the machine is back to normal.

Actions #29

Updated by kraih over 2 years ago

  • Status changed from In Progress to Feedback
Actions #30

Updated by kraih over 2 years ago

  • Status changed from Feedback to Resolved

No changes over the weekend, still everything normal. Guess we can consider this issue resolved. The upstream patch for Minion::Backend::SQLite has also been merged and should be in production soon.

Actions

Also available in: Atom PDF