action #155716: [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S - openQA Project (public) - openSUSE Project Management Tool

Actions

Copy link

action #155716

closed

[alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S

Added by okurz over 1 year ago. Updated about 1 year ago.

Status:

Resolved

Priority:

Urgent

Assignee:

mkittler

Category:

Regressions/Crashes

Target version:

Ready

Start date:

2024-02-21

Due date:

2024-03-07

% Done:

Estimated time:

Tags:

alert, sqlite, reactive work, cacheservice

Description

Observation¶

From https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

ssh worker29.oqa.prg2.suse.org "journalctl -u openqa-worker-cacheservice" says

Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error at /u>
Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Killing processes accessing the database file handles and removing database

Acceptance criteria¶

AC1: Cache service on worker29 works again

Suggestions¶

DONE Add silence(s)
Gather logs helpful for debugging especially before the machine is rebooted
Maybe ext2 is just unreliable -> yes, it is. A reboot of the machine already fixed the problem because we recreate the filesystem automatically
Create another ticket for the related fallout of the reboot triggered problem

Rollback actions¶

Remove silence alertname=Failed systemd services alert from https://monitor.qa.suse.de/alerting/silences
Remove silence alertname=Broken workers alert from https://monitor.qa.suse.de/alerting/silences

Out of scope¶

Using another filesystem

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by okurz over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by mkittler over 1 year ago

Status changed from New to In Progress
Assignee set to mkittler

Actions

Copy link

Updated by mkittler over 1 year ago · Edited

Definitely not an issue with our code:

martchus@worker29:~> sudo -u _openqa-worker sqlite3 /var/lib/openqa/cache/cache.sqlite
SQLite version 3.44.0 2023-11-01 11:23:50
Enter ".help" for usage hints.            
sqlite> select * from assets;
Parse error: no such table: assets
sqlite> create table foo ( `bar` text );
Runtime error: disk I/O error (10)

Same on a completely fresh database file:

martchus@worker29:~> sudo -u _openqa-worker mv /var/lib/openqa/cache/cache.sqlite{,.bak}
martchus@worker29:~> sudo -u _openqa-worker sqlite3 /var/lib/openqa/cache/cache.sqlite
SQLite version 3.44.0 2023-11-01 11:23:50
Enter ".help" for usage hints.
sqlite> create table foo ( `bar` text );
Runtime error: disk I/O error (10)

Looks like the filesystem was broken:

martchus@worker29:~> sudo systemctl stop var-lib-openqa.mount 
martchus@worker29:~> sudo fsck.ext2 /dev/md127
e2fsck 1.46.4 (18-Aug-2021)
/dev/md127 was not cleanly unmounted, check forced.
Error reading block 128 (Input/output error).  Ignore error<y>? yes
Force rewrite<y>? yes

Actions

Copy link

Updated by mkittler about 1 year ago

Related to action #155737: Salt pillars pipelines fail due to refused connection errors on telegraf added

Actions

Copy link

Updated by mkittler about 1 year ago

Status changed from In Progress to Resolved

The worker is back to normal after recreating the filesystem (by rebooting it). I removed the silences.

We could switch to ext4 or another filesystem with journaling but I wouldn't make it part of this ticket.

Actions

Copy link

Updated by okurz about 1 year ago

Copied to action #155764: Consider switching to safer filesystems than ext2 in osd+o3 added

Actions

Copy link

Updated by okurz about 1 year ago

Status changed from Resolved to In Progress

…
@Marius Kittler
could you be so kind to follow http://open.qa/docs/#_debugging_open_vswitch_configuration and check the involved machines? I have the suspicion that the reboot of worker29 triggered the problem

Actions

Copy link

Updated by mkittler about 1 year ago · Edited

I've only seen your comment now. Considering that current jobs look good (e.g. https://openqa.suse.de/tests/13566201#dependencies which ran on worker29 and other jobs in that cluster ran on other workers) I can save the effort.

Considering what we know about the MM setup at this point the problematic behavior is unfortunately expected - although I would have expected that affected clusters are at least cancelled and end up incomplete/parallel_failed. So I checked the timestamps:

The server booted at Feb 21 13:42:48 worker29 kernel:…, that is from the journal and thus CET. The job was started at 2024-02-21 13:40:55 in finished at 13:56:11, those are from the database and thus UTC. So this problem was apparent around an hour after the reboot (and not immediately around the time when the reboot happened). Hence clusters were not cancelled.

Looks like we really ran into some problems when rebooting:

Feb 21 13:43:24 worker29 (udev-worker)[2830]: gre_sys: Could not generate persistent MAC: No data available
Feb 21 13:43:24 worker29 ovs-vsctl[3244]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre2 -- set interface gre2 type=gre options:remote_ip=10.145.10.34
Feb 21 13:43:24 worker29 ovs-vsctl[3258]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre3 -- set interface gre3 type=gre options:remote_ip=10.145.10.3
Feb 21 13:43:24 worker29 ovs-vsctl[3264]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre4 -- set interface gre4 type=gre options:remote_ip=10.145.10.4
Feb 21 13:43:24 worker29 ovs-vsctl[3265]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre5 -- set interface gre5 type=gre options:remote_ip=10.145.10.5
Feb 21 13:43:24 worker29 ovs-vsctl[3266]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre6 -- set interface gre6 type=gre options:remote_ip=10.145.10.6
Feb 21 13:43:24 worker29 ovs-vsctl[3267]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre7 -- set interface gre7 type=gre options:remote_ip=10.145.10.7
Feb 21 13:43:24 worker29 ovs-vsctl[3268]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre8 -- set interface gre8 type=gre options:remote_ip=10.145.10.8
Feb 21 13:43:24 worker29 ovs-vsctl[3269]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre9 -- set interface gre9 type=gre options:remote_ip=10.145.10.9
Feb 21 13:43:24 worker29 ovs-vsctl[3270]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre10 -- set interface gre10 type=gre options:remote_ip=10.145.10.10
Feb 21 13:43:24 worker29 ovs-vsctl[3271]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre11 -- set interface gre11 type=gre options:remote_ip=10.145.10.11
Feb 21 13:43:24 worker29 ovs-vsctl[3272]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre12 -- set interface gre12 type=gre options:remote_ip=10.145.10.12
Feb 21 13:43:24 worker29 ovs-vsctl[3273]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre13 -- set interface gre13 type=gre options:remote_ip=10.145.10.13
Feb 21 13:43:25 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:26 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:26 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:27 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:27 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones

But at some point these messages stopped to occur.

At some point firewalld was restarted:

Feb 21 14:46:44 worker29 worker[4102]: [debug] [pid:4102] Upload concluded (at run)
Feb 21 14:46:45 worker29 systemd[1]: Started /usr/bin/systemctl restart firewalld.service.
Feb 21 14:46:45 worker29 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Feb 21 14:46:46 worker29 worker[4013]: [debug] [pid:4013] Updating status so job 13564903 is not considered dead.
Feb 21 14:46:46 worker29 worker[4013]: [debug] [pid:4013] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13564903/status
Feb 21 14:46:47 worker29 systemd[1]: firewalld.service: Deactivated successfully.
Feb 21 14:46:47 worker29 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Feb 21 14:46:47 worker29 systemd[1]: Starting firewalld - dynamic firewall daemon...
Feb 21 14:46:48 worker29 worker[4057]: [debug] [pid:4057] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13563666/status
Feb 21 14:46:48 worker29 worker[4057]: [debug] [pid:4057] Upload concluded (at run)
Feb 21 14:46:48 worker29 systemd[1]: Started firewalld - dynamic firewall daemon.
Feb 21 14:46:48 worker29 systemd[1]: run-r39edceb429c0449fb8508a6f378f5d5b.scope: Deactivated successfully.
Feb 21 14:46:48 worker29 worker[3977]: [debug] [pid:3977] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13565154/status
Feb 21 14:46:48 worker29 worker[3977]: [debug] [pid:3977] Upload concluded (at bci_prepare)
Feb 21 14:46:48 worker29 worker[4084]: [debug] [pid:4084] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13564888/status
Feb 21 14:46:48 worker29 worker[4084]: [debug] [pid:4084] Upload concluded (at boot_to_desktop)
Feb 21 14:46:48 worker29 worker[4069]: [debug] [pid:4069] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13565196/status
Feb 21 14:46:48 worker29 worker[4069]: [debug] [pid:4069] Upload concluded (at boot_to_desktop)
Feb 21 14:46:48 worker29 worker[3973]: [debug] [pid:3973] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13565074/status
Feb 21 14:46:48 worker29 worker[4150]: [debug] [pid:4150] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13555124/status
Feb 21 14:46:48 worker29 worker[4150]: [debug] [pid:4150] Upload concluded (at generate_report)
Feb 21 14:46:49 worker29 ovs-vsctl[52518]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int br1 mtu_request=1460

This firewalld-restart is actually in the critical timeframe. The zypper logs don't contain anything in that timeframe (so the restart was probably not due to package updates).

I would leave it at this for this ticket and created #155848 for the firewalld-related problem (which by the way is not just happening since I rebooted the machine).

Actions

Copy link

#10

Updated by openqa_review about 1 year ago

Due date set to 2024-03-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

#11

Updated by livdywan about 1 year ago

Subject changed from [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" to [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S
Description updated (diff)

Actions

Copy link

#12

Updated by mkittler about 1 year ago

Status changed from In Progress to Resolved

Since we have #155848 and I asked about restarting jobs in the chat (which can be done as part of #155848 if needed) I'm considering this resolved.

Actions

Copy link

#13

Updated by okurz 11 months ago

Related to action #162485: [alert] failed systemd service: openqa-worker-cacheservice on worker40.oqa.prg2.suse.org "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public)

Tags

Custom queries

action #155716

[alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S

Observation¶

Acceptance criteria¶

Suggestions¶

Rollback actions¶

Out of scope¶

Updated by okurz over 1 year ago

Updated by okurz over 1 year ago

Updated by mkittler over 1 year ago

Updated by mkittler over 1 year ago · Edited

Updated by mkittler about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz about 1 year ago

Updated by okurz about 1 year ago

Updated by mkittler about 1 year ago · Edited

Updated by openqa_review about 1 year ago

Updated by livdywan about 1 year ago

Updated by mkittler about 1 year ago

Updated by okurz 11 months ago