Project

General

Profile

Actions

action #155716

closed

[alert] openqa-worker-cacheservice fails to start on worker29.oqa.prg2.suse.org with "Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error" size:S

Added by okurz 3 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-02-21
Due date:
2024-03-07
% Done:

0%

Estimated time:

Description

Observation

From https://monitor.qa.suse.de/d/KToPYLEWz/failed-systemd-services?orgId=1

ssh worker29.oqa.prg2.suse.org "journalctl -u openqa-worker-cacheservice" says

Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Database has been corrupted: DBD::SQLite::db commit failed: disk I/O error at /u>
Feb 21 09:25:43 worker29 openqa-workercache-daemon[86009]: [86009] [e] Killing processes accessing the database file handles and removing database

Acceptance criteria

  • AC1: Cache service on worker29 works again

Suggestions

  • DONE Add silence(s)
  • Gather logs helpful for debugging especially before the machine is rebooted
  • Maybe ext2 is just unreliable -> yes, it is. A reboot of the machine already fixed the problem because we recreate the filesystem automatically
  • Create another ticket for the related fallout of the reboot triggered problem

Rollback actions

Out of scope

  • Using another filesystem

Related issues 2 (1 open1 closed)

Related to openQA Infrastructure - action #155737: Salt pillars pipelines fail due to refused connection errors on telegrafRejectedokurz

Actions
Copied to openQA Infrastructure - action #155764: Consider switching to safer filesystems than ext2 in osd+o3New

Actions
Actions

Also available in: Atom PDF