Project

General

Profile

Actions

action #155716

closed

[alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S

Added by okurz 2 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-02-21
Due date:
2024-03-07
% Done:

0%

Estimated time:

Description

Observation

From https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

ssh worker29.oqa.prg2.suse.org "journalctl -u openqa-worker-cacheservice" says

Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error at /u>
Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Killing processes accessing the database file handles and removing database

Acceptance criteria

  • AC1: Cache service on worker29 works again

Suggestions

  • DONE Add silence(s)
  • Gather logs helpful for debugging especially before the machine is rebooted
  • Maybe ext2 is just unreliable -> yes, it is. A reboot of the machine already fixed the problem because we recreate the filesystem automatically
  • Create another ticket for the related fallout of the reboot triggered problem

Rollback actions

Out of scope

  • Using another filesystem

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #155737: Salt pillars pipelines fail due to refused connection errors on telegrafRejectedokurz

Actions
Copied to openQA Infrastructure - action #155764: Consider switching to safer filesystems than ext2 in osd+o3New

Actions
Actions #1

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #2

Updated by okurz 2 months ago

  • Description updated (diff)
Actions #3

Updated by mkittler 2 months ago

  • Status changed from New to In Progress
  • Assignee set to mkittler
Actions #4

Updated by mkittler 2 months ago · Edited

Definitely not an issue with our code:

martchus@worker29:~> sudo -u _openqa-worker sqlite3 /var/lib/openqa/cache/cache.sqlite
SQLite version 3.44.0 2023-11-01 11:23:50
Enter ".help" for usage hints.            
sqlite> select * from assets;
Parse error: no such table: assets
sqlite> create table foo ( `bar` text );
Runtime error: disk I/O error (10)

Same on a completely fresh database file:

martchus@worker29:~> sudo -u _openqa-worker mv /var/lib/openqa/cache/cache.sqlite{,.bak}
martchus@worker29:~> sudo -u _openqa-worker sqlite3 /var/lib/openqa/cache/cache.sqlite
SQLite version 3.44.0 2023-11-01 11:23:50
Enter ".help" for usage hints.
sqlite> create table foo ( `bar` text );
Runtime error: disk I/O error (10)

Looks like the filesystem was broken:

martchus@worker29:~> sudo systemctl stop var-lib-openqa.mount 
martchus@worker29:~> sudo fsck.ext2 /dev/md127
e2fsck 1.46.4 (18-Aug-2021)
/dev/md127 was not cleanly unmounted, check forced.
Error reading block 128 (Input/output error).  Ignore error<y>? yes
Force rewrite<y>? yes
Actions #5

Updated by mkittler 2 months ago

  • Related to action #155737: Salt pillars pipelines fail due to refused connection errors on telegraf added
Actions #6

Updated by mkittler 2 months ago

  • Status changed from In Progress to Resolved

The worker is back to normal after recreating the filesystem (by rebooting it). I removed the silences.

We could switch to ext4 or another filesystem with journaling but I wouldn't make it part of this ticket.

Actions #7

Updated by okurz 2 months ago

  • Copied to action #155764: Consider switching to safer filesystems than ext2 in osd+o3 added
Actions #8

Updated by okurz 2 months ago

  • Status changed from Resolved to In Progress

https://suse.slack.com/archives/C02CANHLANP/p1708527510930799
https://openqa.suse.de/t13564994 failed iscsi_client
https://openqa.suse.de/t13565024 failed iscsi_client
https://openqa.suse.de/t13564932 failed iscsi_client
https://openqa.suse.de/t13564920 failed iscsi_client
https://openqa.suse.de/t13564968 failed iscsi_client
https://openqa.suse.de/t13564903 failed iscsi_client
https://openqa.suse.de/t13565058 failed iscsi_client
https://openqa.suse.de/t13564947 failed iscsi_client
https://openqa.suse.de/t13564820 failed iscsi_client
https://openqa.suse.de/t13564901 failed iscsi_client
https://openqa.suse.de/t13564873 failed iscsi_client
https://openqa.suse.de/t13564905 failed iscsi_client
https://openqa.suse.de/t13564936 failed iscsi_client


@Marius Kittler
could you be so kind to follow http://open.qa/docs/#_debugging_open_vswitch_configuration and check the involved machines? I have the suspicion that the reboot of worker29 triggered the problem

Actions #9

Updated by mkittler 2 months ago · Edited

I've only seen your comment now. Considering that current jobs look good (e.g. https://openqa.suse.de/tests/13566201#dependencies which ran on worker29 and other jobs in that cluster ran on other workers) I can save the effort.

Considering what we know about the MM setup at this point the problematic behavior is unfortunately expected - although I would have expected that affected clusters are at least cancelled and end up incomplete/parallel_failed. So I checked the timestamps:

The server booted at Feb 21 13:42:48 worker29 kernel:…, that is from the journal and thus CET. The job was started at 2024-02-21 13:40:55 in finished at 13:56:11, those are from the database and thus UTC. So this problem was apparent around an hour after the reboot (and not immediately around the time when the reboot happened). Hence clusters were not cancelled.

Looks like we really ran into some problems when rebooting:

Feb 21 13:43:24 worker29 (udev-worker)[2830]: gre_sys: Could not generate persistent MAC: No data available
Feb 21 13:43:24 worker29 ovs-vsctl[3244]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre2 -- set interface gre2 type=gre options:remote_ip=10.145.10.34
Feb 21 13:43:24 worker29 ovs-vsctl[3258]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre3 -- set interface gre3 type=gre options:remote_ip=10.145.10.3
Feb 21 13:43:24 worker29 ovs-vsctl[3264]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre4 -- set interface gre4 type=gre options:remote_ip=10.145.10.4
Feb 21 13:43:24 worker29 ovs-vsctl[3265]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre5 -- set interface gre5 type=gre options:remote_ip=10.145.10.5
Feb 21 13:43:24 worker29 ovs-vsctl[3266]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre6 -- set interface gre6 type=gre options:remote_ip=10.145.10.6
Feb 21 13:43:24 worker29 ovs-vsctl[3267]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre7 -- set interface gre7 type=gre options:remote_ip=10.145.10.7
Feb 21 13:43:24 worker29 ovs-vsctl[3268]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre8 -- set interface gre8 type=gre options:remote_ip=10.145.10.8
Feb 21 13:43:24 worker29 ovs-vsctl[3269]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre9 -- set interface gre9 type=gre options:remote_ip=10.145.10.9
Feb 21 13:43:24 worker29 ovs-vsctl[3270]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre10 -- set interface gre10 type=gre options:remote_ip=10.145.10.10
Feb 21 13:43:24 worker29 ovs-vsctl[3271]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre11 -- set interface gre11 type=gre options:remote_ip=10.145.10.11
Feb 21 13:43:24 worker29 ovs-vsctl[3272]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre12 -- set interface gre12 type=gre options:remote_ip=10.145.10.12
Feb 21 13:43:24 worker29 ovs-vsctl[3273]: ovs|00001|vsctl|INFO|Called as ovs-vsctl --may-exist add-port br1 gre13 -- set interface gre13 type=gre options:remote_ip=10.145.10.13
Feb 21 13:43:25 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:26 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:26 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:27 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones
Feb 21 13:43:27 worker29 firewalld[2198]: ERROR: Calling pre func <bound method Firewall.full_check_config of <class 'firewall.core.fw.Firewall'>(True, True, True, 'RUNNING', False, 'trusted', {}, [], True, True, True, False, 'off')>(()) failed: INVALID_ZONE: 'libvirt-routed' not among existing zones

But at some point these messages stopped to occur.

At some point firewalld was restarted:

Feb 21 14:46:44 worker29 worker[4102]: [debug] [pid:4102] Upload concluded (at run)
Feb 21 14:46:45 worker29 systemd[1]: Started /usr/bin/systemctl restart firewalld.service.
Feb 21 14:46:45 worker29 systemd[1]: Stopping firewalld - dynamic firewall daemon...
Feb 21 14:46:46 worker29 worker[4013]: [debug] [pid:4013] Updating status so job 13564903 is not considered dead.
Feb 21 14:46:46 worker29 worker[4013]: [debug] [pid:4013] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13564903/status
Feb 21 14:46:47 worker29 systemd[1]: firewalld.service: Deactivated successfully.
Feb 21 14:46:47 worker29 systemd[1]: Stopped firewalld - dynamic firewall daemon.
Feb 21 14:46:47 worker29 systemd[1]: Starting firewalld - dynamic firewall daemon...
Feb 21 14:46:48 worker29 worker[4057]: [debug] [pid:4057] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13563666/status
Feb 21 14:46:48 worker29 worker[4057]: [debug] [pid:4057] Upload concluded (at run)
Feb 21 14:46:48 worker29 systemd[1]: Started firewalld - dynamic firewall daemon.
Feb 21 14:46:48 worker29 systemd[1]: run-r39edceb429c0449fb8508a6f378f5d5b.scope: Deactivated successfully.
Feb 21 14:46:48 worker29 worker[3977]: [debug] [pid:3977] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13565154/status
Feb 21 14:46:48 worker29 worker[3977]: [debug] [pid:3977] Upload concluded (at bci_prepare)
Feb 21 14:46:48 worker29 worker[4084]: [debug] [pid:4084] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13564888/status
Feb 21 14:46:48 worker29 worker[4084]: [debug] [pid:4084] Upload concluded (at boot_to_desktop)
Feb 21 14:46:48 worker29 worker[4069]: [debug] [pid:4069] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13565196/status
Feb 21 14:46:48 worker29 worker[4069]: [debug] [pid:4069] Upload concluded (at boot_to_desktop)
Feb 21 14:46:48 worker29 worker[3973]: [debug] [pid:3973] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13565074/status
Feb 21 14:46:48 worker29 worker[4150]: [debug] [pid:4150] REST-API call: POST http://openqa.suse.de/api/v1/jobs/13555124/status
Feb 21 14:46:48 worker29 worker[4150]: [debug] [pid:4150] Upload concluded (at generate_report)
Feb 21 14:46:49 worker29 ovs-vsctl[52518]: ovs|00001|vsctl|INFO|Called as ovs-vsctl set int br1 mtu_request=1460

This firewalld-restart is actually in the critical timeframe. The zypper logs don't contain anything in that timeframe (so the restart was probably not due to package updates).

I would leave it at this for this ticket and created #155848 for the firewalld-related problem (which by the way is not just happening since I rebooted the machine).

Actions #10

Updated by openqa_review 2 months ago

  • Due date set to 2024-03-07

Setting due date based on mean cycle time of SUSE QE Tools

Actions #11

Updated by livdywan 2 months ago

  • Subject changed from [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" to [alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S
  • Description updated (diff)
Actions #12

Updated by mkittler 2 months ago

  • Status changed from In Progress to Resolved

Since we have #155848 and I asked about restarting jobs in the chat (which can be done as part of #155848 if needed) I'm considering this resolved.

Actions

Also available in: Atom PDF