action #107017: Random asset download (cache service) failures on openqaworker1 - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #107017

closed

Random asset download (cache service) failures on openqaworker1

Added by favogt over 3 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

High

Assignee:

kraih

Category:

Target version:

openQA Project (public) - Ready

Start date:

Due date:

% Done:

100%

Estimated time:

Description

Observation¶

On openqaworker1, several jobs incomplete due to a setup failure, e.g.:

https://openqa.opensuse.org/tests/2191909

[2022-02-16T21:47:58.251560+01:00] [debug] Found HDD_1, caching opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2
[2022-02-16T21:47:59.691676+01:00] [info] Downloading opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2, request #8258 sent to Cache Service
[2022-02-16T22:44:08.642013+01:00] [info] +++ worker notes +++
[2022-02-16T22:44:08.652197+01:00] [info] End time: 2022-02-16 21:44:08
[2022-02-16T22:44:08.652264+01:00] [info] Result: timeout

https://openqa.opensuse.org/tests/2193698

[2022-02-17T13:37:16.233412+01:00] [debug] Found ISO, caching openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso
[2022-02-17T13:37:40.012614+01:00] [info] Downloading openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso, request #9191 sent to Cache Service
[2022-02-17T14:33:00.965395+01:00] [info] +++ worker notes +++
[2022-02-17T14:33:00.965569+01:00] [info] End time: 2022-02-17 13:33:00
[2022-02-17T14:33:00.965644+01:00] [info] Result: timeout
[2022-02-17T14:33:00.974060+01:00] [info] Uploading autoinst-log.txt

Most likely this started after the worker count and cache sizes were increased after the NVMe deployment.

The minion web interface shows some errors for the GNOME-Live download:

---
args:
- 2193699
- iso
- openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso
- http://openqa1-opensuse
attempts: 1
children: []
created: 2022-02-17T12:36:17Z
delayed: 2022-02-17T12:36:17Z
expires: ~
finished: 2022-02-17T12:38:45Z
id: 9188
lax: 0
notes:
  lock: openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso.http://openqa1-opensuse
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 11, signal: 0)'
retried: ~
retries: 0
started: 2022-02-17T12:36:43Z
state: failed
task: cache_asset
time: 2022-02-17T13:45:42Z
worker: 18

---
args:
- 2193696
- iso
- openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso
- http://openqa1-opensuse
attempts: 1
children: []
created: 2022-02-17T12:36:56Z
delayed: 2022-02-17T12:36:56Z
expires: ~
finished: 2022-02-17T12:41:18Z
id: 9190
lax: 0
notes:
  lock: openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso.http://openqa1-opensuse
parents: []
priority: 0
queue: default
result: 'Job terminated unexpectedly (exit code: 2, signal: 0)'
retried: ~
retries: 0
started: 2022-02-17T12:38:11Z
state: failed
task: cache_asset
time: 2022-02-17T13:45:42Z
worker: 18

Might be some miscellaneous network issues? I couldn't find any further information.

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by favogt over 3 years ago

Copied from action #103524: OW1: performance loss size:M added

Actions

Copy link

Updated by kraih over 3 years ago

One of those jobs exited with a segfault, that is definitely not expected behaviour and cannot be caused by Perl code (must have been an XS module misbehaving). Hopefully just a one-time occurrence, otherwise we might have to check if any Perl related packages have been updated recently.

Actions

Copy link

Updated by kraih over 3 years ago

Looked up the jobs in question in the database:

8258|[2191909,"hdd","opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2","http:\/\/openqa1-opensuse"]|2022-02-16 20:47:58|2022-02-16 20:47:58|2022-02-16 20:48:50|0|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"||0|2022-02-16 20:47:59|failed|cache_asset|10|default|1|[]|{"lock":"opensuse-Tumbleweed-x86_64-20220215-textmode@64bit.qcow2.http:\/\/openqa1-opensuse"}||0

9191|[2193698,"iso","openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso","http:\/\/openqa1-opensuse"]|2022-02-17 12:37:16|2022-02-17 12:37:16|2022-02-17 12:41:18|0|"Job terminated unexpectedly (exit code: 11, signal: 0)"||0|2022-02-17 12:38:15|failed|cache_asset|18|default|1|[]|{"lock":"openSUSE-Leap-15.3-GNOME-Live-x86_64-Build9.374-Media.iso.http:\/\/openqa1-opensuse"}||0

It does look like the machine is having issues, and it's not the network.

Actions

Copy link

Updated by kraih over 3 years ago

Might be SQLite related, maybe a storage issue? There's quite a few more similar cases:

5728|2022-02-14 21:32:56|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6945|2022-02-16 00:43:19|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6988|2022-02-16 01:12:39|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8234|2022-02-16 20:21:07|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8272|2022-02-16 21:18:22|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8289|2022-02-16 21:30:35|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8291|2022-02-16 21:30:22|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8293|2022-02-16 21:35:39|failed|"Worker went away"
8428|2022-02-16 22:51:41|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8665|2022-02-17 01:02:37|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8789|2022-02-17 02:30:15|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
9343|2022-02-17 14:37:14|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
3272|2022-02-13 02:30:13|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
5444|2022-02-14 19:09:07|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
5445|2022-02-14 19:09:10|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
5446|2022-02-14 19:09:10|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6319|2022-02-15 18:46:35|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6557|2022-02-15 21:17:52|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6572|2022-02-15 21:27:09|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
6953|2022-02-16 00:49:09|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
7026|2022-02-16 01:37:05|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
7435|2022-02-16 05:51:25|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
7835|2022-02-16 14:45:41|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
7941|2022-02-16 16:06:13|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8243|2022-02-16 20:29:20|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8247|2022-02-16 20:35:51|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8258|2022-02-16 20:48:50|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8371|2022-02-16 22:06:09|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8388|2022-02-16 22:11:38|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8405|2022-02-16 22:24:23|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8413|2022-02-16 22:32:29|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8420|2022-02-16 22:37:58|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8421|2022-02-16 22:39:30|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8425|2022-02-16 22:45:08|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8507|2022-02-16 23:34:24|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
8658|2022-02-17 00:47:46|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8667|2022-02-17 01:04:11|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
8668|2022-02-17 01:04:15|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8670|2022-02-17 01:07:35|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
8671|2022-02-17 01:07:16|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8676|2022-02-17 01:13:16|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8679|2022-02-17 01:19:36|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
8681|2022-02-17 01:20:20|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
8783|2022-02-17 02:30:02|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
8794|2022-02-17 02:30:15|failed|"Job terminated unexpectedly (exit code: 0, signal: 15)"
8840|2022-02-17 03:19:54|failed|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"
9188|2022-02-17 12:38:45|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"
9189|2022-02-17 12:41:12|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
9190|2022-02-17 12:41:18|failed|"Job terminated unexpectedly (exit code: 2, signal: 0)"
9191|2022-02-17 12:41:18|failed|"Job terminated unexpectedly (exit code: 11, signal: 0)"

The cases of Job terminated unexpectedly without a signal are especially concerning.

Actions

Copy link

Updated by favogt over 3 years ago

kraih wrote:

Might be SQLite related, maybe a storage issue? There's quite a few more similar cases:
...
The cases of Job terminated unexpectedly without a signal are especially concerning.

Indeed. A coredump or just a backtrace would help already... I don't see any SEGVs in dmesg though.

If it's caused by sqlite, then the procedure from https://progress.opensuse.org/issues/67000#note-31 might work here. Or just deleting the database altogether.

Actions

Copy link

Updated by kraih over 3 years ago

favogt wrote:

If it's caused by sqlite, then the procedure from https://progress.opensuse.org/issues/67000#note-31 might work here. Or just deleting the database altogether.

You might be right, file corruption would be an explanation, and deleting the sqlite file could help.

Actions

Copy link

Updated by kraih over 3 years ago

Assignee set to kraih

Actions

Copy link

Updated by kraih over 3 years ago

I've recreated the cache database, let's see what happens.

Actions

Copy link

Updated by kraih over 3 years ago

Something is still very wrong, i can't even check the journal mode because the database is constantly locked:

SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked
sqlite> PRAGMA journal_mode;
Error: database is locked

Which should mean very slow disk I/O.

Actions

Copy link

#10

Updated by nicksinger over 3 years ago

Just to clarify, we're talking about data residing below /var/lib/openqa, right?

We have this data and the root fs both on a RAID 1 so I'd assume more damage if hardware is broken. But we have to check. From a first glance at the system everything looks (almost) fine:

openqaworker1:~ # mdadm --detail /dev/md127
/dev/md127:
           Version : 1.2
     Creation Time : Wed Feb  9 16:22:20 2022
        Raid Level : raid0
        Array Size : 1390651392 (1326.23 GiB 1424.03 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Wed Feb  9 16:22:20 2022
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : original
        Chunk Size : 512K

Consistency Policy : none

              Name : openqaworker1:openqa  (local to host openqaworker1)
              UUID : 4030b549:812c9e8a:22ad7ec4:d4478a72
            Events : 0

    Number   Major   Minor   RaidDevice State
       0     259        1        0      active sync   /dev/nvme0n1
       1     259        0        1      active sync   /dev/nvme1n1
openqaworker1:~ # smartctl -a /dev/nvme0n1
smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-150300.59.49-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPE2ME400G4
Serial Number:                      CVMD4352001W400FGN
Firmware Version:                   8DV10151
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          400.088.457.216 [400 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Feb 17 17:24:31 2022 CET
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x0006):     Wr_Unc DS_Mngmt
Maximum Data Transfer Size:         32 Pages

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W       -        -    0  0  0  0        0       0

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -     512       8         2
 2 -     512      16         2
 3 -    4096       0         0
 4 -    4096       8         0
 5 -    4096      64         0
 6 -    4096     128         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    99%
Available Spare Threshold:          10%
Percentage Used:                    40%
Data Units Read:                    1.312.831.523 [672 TB]
Data Units Written:                 2.954.476.718 [1,51 PB]
Host Read Commands:                 8.937.663.770
Host Write Commands:                15.006.645.683
Controller Busy Time:               12.752
Power Cycles:                       103
Power On Hours:                     56.986
Unsafe Shutdowns:                   22
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0          1    13  0x0000  0x4502  0x000            0     1     -

openqaworker1:~ # smartctl -a /dev/nvme1n1
smartctl 7.0 2019-05-21 r4917 [x86_64-linux-5.3.18-150300.59.49-default] (SUSE RPM)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       INTEL SSDPEKNW010T8
Serial Number:                      BTNH117605HA1P0B
Firmware Version:                   004C
PCI Vendor/Subsystem ID:            0x8086
IEEE OUI Identifier:                0x5cd2e4
Controller ID:                      1
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1.024.209.543.168 [1,02 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Feb 17 17:24:33 2022 CET
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         32 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     80 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.00W       -        -    0  0  0  0        0       0
 1 +     3.00W       -        -    1  1  1  1        0       0
 2 +     2.20W       -        -    2  2  2  2        0       0
 3 -   0.0300W       -        -    3  3  3  3     5000    5000
 4 -   0.0040W       -        -    4  4  4  4     5000    9000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        53 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    12.041.103 [6,16 TB]
Data Units Written:                 28.202.638 [14,4 TB]
Host Read Commands:                 82.636.540
Host Write Commands:                137.849.821
Controller Busy Time:               4.807
Power Cycles:                       1
Power On Hours:                     226
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, max 256 entries)
No Errors Logged

/dev/nvme0n1 has one error logged so maybe worth investigating further

Actions

Copy link

#11

Updated by nicksinger over 3 years ago

openqaworker1:~ # nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:64
.................
 Entry[ 0]
.................
error_count	: 1
sqid		: 13
cmdid		: 0
status_field	: 0x4502(READ_ERROR: The read data could not be recovered from the media)
parm_err_loc	: 0
lba		: 0
nsid		: 0x1
vs		: 0
trtype		: The transport type is not indicated or the error is not transport related.
cs		: 0
trtype_spec_info: 0
.................

Actions

Copy link

#12

Updated by kraih over 3 years ago

I've masked all 20 workers and restarted the machine. Tried to run hdparm --direct -t -T /dev/md127, but even after 10 minutes didn't get any output and the process wouldn't react to a kill -9.

Actions

Copy link

#13

Updated by kraih over 3 years ago

After reboot i'm seeing some results that look ok:

$ hdparm  --direct -t -T /dev/md127

/dev/md127:
 Timing O_DIRECT cached reads:   1582 MB in  2.00 seconds = 790.47 MB/sec
 Timing O_DIRECT disk reads: 4572 MB in  3.00 seconds = 1523.69 MB/sec

I'll bring the first two workers online again to collect some data.

Actions

Copy link

#14

Updated by favogt over 3 years ago

That means the issue from https://progress.opensuse.org/issues/103524 is back or still there, even after replacement of nvme1n1.

The issue is that performance tanks at a random point of use, and flushing (triggered by e.g. QEMU and hdparm) take several minutes.
Read access is also impacted:

openqaworker1:~ # dd if=/dev/nvme1n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1033895936 Bytes (1,0 GB, 986 MiB) kopiert, 37 s, 27,9 MB/s
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 37,5085 s, 28,0 MB/s

In contrast, /dev/nvme0n1:

openqaworker1:~ # dd if=/dev/nvme0n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 0,877994 s, 1,2 GB/s

I tried the latest kernel from Kernel:stable:Backport to see whether it makes a difference, but it was still slow after a reboot.

However, it was resolved instantly after running fstrim:

openqaworker1:~ # fstrim -v /var/lib/openqa
/var/lib/openqa: 1,2 TiB (1312675467264 Bytes) getrimmt
openqaworker1:~ # dd if=/dev/nvme1n1 bs=1M count=1000 status=progress iflag=direct of=/dev/null
1000+0 Datensätze ein
1000+0 Datensätze aus
1048576000 Bytes (1,0 GB, 1000 MiB) kopiert, 0,743051 s, 1,4 GB/s

A daily fstrim.timer might help, I'll enable that.

Actions

Copy link

#15

Updated by kraih over 3 years ago

Assignee deleted (~~kraih~~)

Actions

Copy link

#16

Updated by favogt over 3 years ago

Is duplicate of action #103524: OW1: performance loss size:M added

Actions

Copy link

#17

Updated by favogt over 3 years ago

Status changed from New to Closed
% Done changed from 0 to 100

So just https://progress.opensuse.org/issues/103524 again, with slightly different symptoms on top.

Actions

Copy link

#18

Updated by kraih over 3 years ago

Status changed from Closed to New
% Done changed from 100 to 0

The SQLite issues are also back this morning:

461|[2195408,"other","Tumbleweed.x86_64-1.0-libvirt-Snapshot20220216.vagrant.libvirt.box","http:\/\/openqa1-opensuse"]|2022-02-18 09:41:14|2022-02-18 09:41:14|2022-02-18 09:41:53|0|"DBD::SQLite::db do failed: database is locked at \/usr\/share\/openqa\/script\/..\/lib\/OpenQA\/CacheService.pm line 77.\n"||0|2022-02-18 09:41:17|failed|cache_asset|4|default|1|[]|{"lock":"Tumbleweed.x86_64-1.0-libvirt-Snapshot20220216.vagrant.libvirt.box.http:\/\/openqa1-opensuse"}||0

Disk I/O gets so slow that SQLite can't keep up anymore with journal commits. So, i can confirm that this is not a cache service specific issue.

Actions

Copy link

#19

Updated by kraih over 3 years ago

% Done changed from 0 to 100

Actions

Copy link

#20

Updated by kraih over 3 years ago

Status changed from New to Closed
Assignee set to kraih

Actions

Copy link

#21

Updated by kraih over 3 years ago

Status changed from Closed to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #107017

Random asset download (cache service) failures on openqaworker1

Observation¶

Updated by favogt over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by favogt over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by nicksinger over 3 years ago

Updated by nicksinger over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by favogt over 3 years ago

Updated by kraih over 3 years ago

Updated by favogt over 3 years ago

Updated by favogt over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago

Updated by kraih over 3 years ago