Project

General

Profile

Actions

action #107017

closed

Random asset download (cache service) failures on openqaworker1

Added by favogt about 2 years ago. Updated about 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

100%

Estimated time:

Description

Observation

On openqaworker1, several jobs incomplete due to a setup failure, e.g.:

https://openqa.opensuse.org/tests/2191909

[2022-02-16T21:47:58.251560+01:00] [debug] Found HDD_1, caching opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2
[2022-02-16T21:47:59.691676+01:00] [info] Downloading opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2, request #8258 sent to Cache Service
[2022-02-16T22:44:08.642013+01:00] [info] +++ worker notes +++
[2022-02-16T22:44:08.652197+01:00] [info] End time: 2022-02-16 21:44:08
[2022-02-16T22:44:08.652264+01:00] [info] Result: timeout

https://openqa.opensuse.org/tests/2193698

[2022-02-17T13:37:16.233412+01:00] [debug] Found ISO, caching openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso
[2022-02-17T13:37:40.012614+01:00] [info] Downloading openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso, request #9191 sent to Cache Service
[2022-02-17T14:33:00.965395+01:00] [info] +++ worker notes +++
[2022-02-17T14:33:00.965569+01:00] [info] End time: 2022-02-17 13:33:00
[2022-02-17T14:33:00.965644+01:00] [info] Result: timeout
[2022-02-17T14:33:00.974060+01:00] [info] Uploading autoinst-log.txt

Most likely this started after the worker count and cache sizes were increased after the NVMe deployment.

The minion web interface shows some errors for the GNOME-Live download:

---
args:
- 2193699
- iso
- openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso
- http://openqa1-opensuse
attempts: 1
children: []
created: 2022-02-17T12:36:17Z
delayed: 2022-02-17T12:36:17Z
expires: ~
finished: 2022-02-17T12:38:45Z
id: 9188
lax: 0
notes:
  lock: openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso.http://openqa1-opensuse
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 11, signal: 0)'
retried: ~
retries: 0
started: 2022-02-17T12:36:43Z
state: failed
task: cache_asset
time: 2022-02-17T13:45:42Z
worker: 18
---
args:
- 2193696
- iso
- openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso
- http://openqa1-opensuse
attempts: 1
children: []
created: 2022-02-17T12:36:56Z
delayed: 2022-02-17T12:36:56Z
expires: ~
finished: 2022-02-17T12:41:18Z
id: 9190
lax: 0
notes:
  lock: openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso.http://openqa1-opensuse
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 2, signal: 0)'
retried: ~
retries: 0
started: 2022-02-17T12:38:11Z
state: failed
task: cache_asset
time: 2022-02-17T13:45:42Z
worker: 18

Might be some miscellaneous network issues? I couldn't find any further information.


Related issues 2 (0 open2 closed)

Is duplicate of openQA Infrastructure - action #103524: OW1: performance loss size:MResolvedmkittler2021-12-06

Actions
Copied from openQA Infrastructure - action #103524: OW1: performance loss size:MResolvedmkittler2021-12-06

Actions
Actions #1

Updated by favogt about 2 years ago

Actions #2

Updated by kraih about 2 years ago

One of those jobs exited with a segfault, that is definitely not expected behaviour and cannot be caused by Perl code (must have been an XS module misbehaving). Hopefully just a one-time occurrence, otherwise we might have to check if any Perl related packages have been updated recently.

Actions #3

Updated by kraih about 2 years ago

Looked up the jobs in question in the database:

8258|[2191909,"hdd","opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2","http:\/\/openqa1-opensuse"]|2022-02-16 20:47:58|2022-02-16 20:47:58|2022-02-16 20:48:50|0|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"||0|2022-02-16 20:47:59|failed|cache_asset|10|default|1|[]|{"lock":"opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2.http:\/\/openqa1-opensuse"}||0

9191|[2193698,"iso","openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso","http:\/\/openqa1-opensuse"]|2022-02-17 12:37:16|2022-02-17 12:37:16|2022-02-17 12:41:18|0|"Job terminated unexpectedly (exit code: 11, signal: 0)"||0|2022-02-17 12:38:15|failed|cache_asset|18|default|1|[]|{"lock":"openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso.http:\/\/openqa1-opensuse"}||0

It does look like the machine is having issues, and it's not the network.

Actions #4

Updated by kraih about 2 years ago

Might be SQLite related, maybe a storage issue? There's quite a few more similar cases:

5728|2022-02-14 21:32:56|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6945|2022-02-16 00:43:19|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6988|2022-02-16 01:12:39|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8234|2022-02-16 20:21:07|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8272|2022-02-16 21:18:22|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8289|2022-02-16 21:30:35|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8291|2022-02-16 21:30:22|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8293|2022-02-16 21:35:39|failed|"Worker went away"
8428|2022-02-16 22:51:41|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8665|2022-02-17 01:02:37|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8789|2022-02-17 02:30:15|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
9343|2022-02-17 14:37:14|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
3272|2022-02-13 02:30:13|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
5444|2022-02-14 19:09:07|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
5445|2022-02-14 19:09:10|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
5446|2022-02-14 19:09:10|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6319|2022-02-15 18:46:35|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6557|2022-02-15 21:17:52|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6572|2022-02-15 21:27:09|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6953|2022-02-16 00:49:09|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
7026|2022-02-16 01:37:05|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
7435|2022-02-16 05:51:25|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
7835|2022-02-16 14:45:41|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
7941|2022-02-16 16:06:13|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8243|2022-02-16 20:29:20|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8247|2022-02-16 20:35:51|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8258|2022-02-16 20:48:50|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8371|2022-02-16 22:06:09|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8388|2022-02-16 22:11:38|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8405|2022-02-16 22:24:23|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8413|2022-02-16 22:32:29|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8420|2022-02-16 22:37:58|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8421|2022-02-16 22:39:30|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8425|2022-02-16 22:45:08|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8507|2022-02-16 23:34:24|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8658|2022-02-17 00:47:46|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8667|2022-02-17 01:04:11|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
8668|2022-02-17 01:04:15|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8670|2022-02-17 01:07:35|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
8671|2022-02-17 01:07:16|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8676|2022-02-17 01:13:16|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8679|2022-02-17 01:19:36|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
8681|2022-02-17 01:20:20|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8783|2022-02-17 02:30:02|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
8794|2022-02-17 02:30:15|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
8840|2022-02-17 03:19:54|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
9188|2022-02-17 12:38:45|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
9189|2022-02-17 12:41:12|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
9190|2022-02-17 12:41:18|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
9191|2022-02-17 12:41:18|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"

The cases of Job terminated unexpectedly without a signal are especially concerning.

Actions #5

Updated by favogt about 2 years ago

kraih wrote:

Might be SQLite related, maybe a storage issue? There's quite a few more similar cases:
...
The cases of Job terminated unexpectedly without a signal are especially concerning.

Indeed. A coredump or just a backtrace would help already... I don't see any SEGVs in dmesg though.

If it's caused by sqlite, then the procedure from https://progress.opensuse.org/issues/67000#note-31 might work here. Or just deleting the database altogether.

Actions #6

Updated by kraih about 2 years ago

favogt wrote:

If it's caused by sqlite, then the procedure from https://progress.opensuse.org/issues/67000#note-31 might work here. Or just deleting the database altogether.

You might be right, file corruption would be an explanation, and deleting the sqlite file could help.

Actions #7

Updated by kraih about 2 years ago

  • Assignee set to kraih
Actions #8

Updated by kraih about 2 years ago

I've recreated the cache database, let's see what happens.

Actions #9

Updated by kraih about 2 years ago

Something is still very wrong, i can't even check the journal mode because the database is constantly locked:

SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked

Which should mean very slow disk I/O.

Actions #10

Updated by nicksinger about 2 years ago

Just to clarify, we're talking about data residing below /var/lib/openqa, right?

We have this data and the root fs both on a RAID 1 so I'd assume more damage if hardware is broken. But we have to check. From a first glance at the system everything looks (almost) fine:

openqaworker1:~ # mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Wed Feb  9 16:22:20 2022
        Raid Level : raid0
        Array Size : 1390651392 (1326.23 GiB 1424.03 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Wed Feb  9 16:22:20 2022
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : original
        Chunk Size : 512K

Consistency Policy : none

              Name : openqaworker1:openqa  (local to host openqaworker1)
              UUID : 4030b549:812c9e8a:22ad7ec4:d4478a72
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259        1        0      active sync   /dev/nvme0n1
       1     259        0        1      active sync   /dev/nvme1n1
openqaworker1:~ # smartctl -a /dev/nvme0n1
smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-150300.59.49-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPE2ME400G4
Serial Number:                      CVMD4352001W400FGN
Firmware Version:                   8DV10151
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          400.088.457.216 [400 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Feb 17 17:24:31 2022 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x0006):     Wr_Unc DS_Mngmt
Maximum Data Transfer Size:         32 Pages

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -     512       8         2
 2 -     512      16         2
 3 -    4096       0         0
 4 -    4096       8         0
 5 -    4096      64         0
 6 -    4096     128         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    99%
Available Spare Threshold:          10%
Percentage Used:                    40%
Data Units Read:                    1.312.831.523 [672 TB]
Data Units Written:                 2.954.476.718 [1,51 PB]
Host Read Commands:                 8.937.663.770
Host Write Commands:                15.006.645.683
Controller Busy Time:               12.752
Power Cycles:                       103
Power On Hours:                     56.986
Unsafe Shutdowns:                   22
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          1    13  0x0000  0x4502  0x000            0     1     -

openqaworker1:~ # smartctl -a /dev/nvme1n1
smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-150300.59.49-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNW010T8
Serial Number:                      BTNH117605HA1P0B
Firmware Version:                   004C
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1.024.209.543.168 [1,02 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Feb 17 17:24:33 2022 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.00W       -        -    0  0  0  0        0       0
 1 +     3.00W       -        -    1  1  1  1        0       0
 2 +     2.20W       -        -    2  2  2  2        0       0
 3 -   0.0300W       -        -    3  3  3  3     5000    5000
 4 -   0.0040W       -        -    4  4  4  4     5000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        53 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    12.041.103 [6,16 TB]
Data Units Written:                 28.202.638 [14,4 TB]
Host Read Commands:                 82.636.540
Host Write Commands:                137.849.821
Controller Busy Time:               4.807
Power Cycles:                       1
Power On Hours:                     226
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

/dev/nvme0n1 has one error logged so maybe worth investigating further

Actions #11

Updated by nicksinger about 2 years ago

openqaworker1:~ # nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:64
.................
 Entry[ 0]
.................
error_count : 1
sqid        : 13
cmdid       : 0
status_field    : 0x4502(READ_ERROR: The read data could not be recovered from the media)
parm_err_loc    : 0
lba     : 0
nsid        : 0x1
vs      : 0
trtype      : The transport type is not indicated or the error is not transport related.
cs      : 0
trtype_spec_info: 0
.................
Actions #12

Updated by kraih about 2 years ago

I've masked all 20 workers and restarted the machine. Tried to run hdparm --direct -t -T /dev/md127, but even after 10 minutes didn't get any output and the process wouldn't react to a kill -9.

Actions #13

Updated by kraih about 2 years ago

After reboot i'm seeing some results that look ok:

$ hdparm  --direct -t -T /dev/md127

/dev/md127:
 Timing O_DIRECT cached reads:   1582 MB in  2.00 seconds = 790.47 MB/sec
 Timing O_DIRECT disk reads: 4572 MB in  3.00 seconds = 1523.69 MB/sec

I'll bring the first two workers online again to collect some data.

Actions #14

Updated by favogt about 2 years ago

That means the issue from https://progress.opensuse.org/issues/103524 is back or still there, even after replacement of nvme1n1.

The issue is that performance tanks at a random point of use, and flushing (triggered by e.g. QEMU and hdparm) take several minutes.
Read access is also impacted:

openqaworker1:~ # dd if=/dev/nvme1n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1033895936 Bytes (1,0 GB, 986 MiB) kopiert, 37 s, 27,9 MB/s
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 37,5085 s, 28,0 MB/s

In contrast, /dev/nvme0n1:

openqaworker1:~ # dd if=/dev/nvme0n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 0,877994 s, 1,2 GB/s

I tried the latest kernel from Kernel:stable:Backport to see whether it makes a difference, but it was still slow after a reboot.

However, it was resolved instantly after running fstrim:

openqaworker1:~ # fstrim -v /var/lib/openqa
/var/lib/openqa: 1,2 TiB (1312675467264 Bytes) getrimmt
openqaworker1:~ # dd if=/dev/nvme1n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 0,743051 s, 1,4 GB/s

A daily fstrim.timer might help, I'll enable that.

Actions #15

Updated by kraih about 2 years ago

  • Assignee deleted (kraih)
Actions #16

Updated by favogt about 2 years ago

Actions #17

Updated by favogt about 2 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

So just https://progress.opensuse.org/issues/103524 again, with slightly different symptoms on top.

Actions #18

Updated by kraih about 2 years ago

  • Status changed from Closed to New
  • % Done changed from 100 to 0

The SQLite issues are also back this morning:

461|[2195408,"other","Tumbleweed.x86_64-1.0-libvirt-Snapshot20220216.vagrant.libvirt.box","http:\/\/openqa1-opensuse"]|2022-02-18 09:41:14|2022-02-18 09:41:14|2022-02-18 09:41:53|0|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"||0|2022-02-18 09:41:17|failed|cache_asset|4|default|1|[]|{"lock":"Tumbleweed.x86_64-1.0-libvirt-Snapshot20220216.vagrant.libvirt.box.http:\/\/openqa1-opensuse"}||0

Disk I/O gets so slow that SQLite can't keep up anymore with journal commits. So, i can confirm that this is not a cache service specific issue.

Actions #19

Updated by kraih about 2 years ago

  • % Done changed from 0 to 100
Actions #20

Updated by kraih about 2 years ago

  • Status changed from New to Closed
  • Assignee set to kraih
Actions #21

Updated by kraih about 2 years ago

  • Status changed from Closed to Resolved
Actions

Also available in: Atom PDF