Project

General

Profile

Actions

action #122998

closed

o3 worker rebel is down; was: inconsistent package database or filesystem corruption size:M

Added by okurz almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2023-01-12
Due date:
% Done:

0%

Estimated time:
Tags:

Description

Observation

Note: The worker is currently completely offline and needs to be recovered first.

Tina Müller tried to install a perl package on o3 workers and reported "file conflicts":

on rebel I tried to install perl-IO-Compress-Lzma and got:

Loading repository data...        
Reading installed packages...                        
Resolving package dependencies...                                    

The following 8 NEW packages are going to be installed:           
  openQA-client openQA-common openQA-worker perl-Compress-Raw-Bzip2 perl-Compress-Raw-Lzma perl-Compress-Raw-Zlib perl-IO-Compress perl-IO-Compress-Lzma

so it tried to install much more than requested, and then got conflicts like:

Detected 11 file conflicts:                                          

File /usr/share/openqa/lib/OpenQA/CacheService/Client.pm   
  from install of                                                    
     openQA-worker-4.6.1673443568.582a087-lp154.5532.1.noarch (devel_openQA)
  conflicts with file from package                               
     perl-Net-SSH2-0.73-lp153.2.2.x86_64 (@System)               

File /usr/share/openqa/lib/OpenQA/CacheService/Controller/API.pm                                                                           
  from install of                                       
     openQA-worker-4.6.1673443568.582a087-lp154.5532.1.noarch (devel_openQA)                                                                                                                                                                                                           
  conflicts with file from package                                                                                                                                                                                                                                                     
     perl-Net-SSH2-0.73-lp153.2.2.x86_64 (@System)                      

looks like some corruption to me.

Acceptance criteria

  • AC1: Machine is online
  • AC2: No more package conflicts

Suggestions

  • IPMI reboot
  • Use the old ipmi command, the machine isn't in the new security zone yet
  • May need to reinstall the machine
Actions #1

Updated by okurz almost 2 years ago

  • Status changed from In Progress to New
  • Assignee deleted (okurz)

Calls like df -h take very long to execute. I suspect a deeper problem. system is up since 37 days. Let's try a reboot. Triggered a reboot and now – not very unexpectedly – the machine does not come up again after reboot so that's the next thing to check.

Actions #2

Updated by livdywan almost 2 years ago

  • Subject changed from o3 worker rebel seems to have inconsistent package database or filesystem corruption to o3 worker rebel is down; was: inconsistent package database or filesystem corruption size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by tinita almost 2 years ago

  • Description updated (diff)
Actions #4

Updated by mkittler almost 2 years ago

It was in the rescue shell and booted after pressing Ctrl + D. Journalctl shows the following relevant output:

Jan 12 09:45:23 localhost systemd[1]: Starting Apply Kernel Variables...
Jan 12 09:45:23 localhost systemd[1]: Finished Apply Kernel Variables.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-5d64b703\x2d4230\x2d4e81\x2d8264\x2d4bb436fd8546.device: Job dev-disk-by\x2duuid-5d64b703\x2d4230>
Jan 12 09:46:53 localhost systemd[1]: Timed out waiting for device /dev/disk/by-uuid/5d64b703-4230-4e81-8264-4bb436fd8546.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /root.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for Local File Systems.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for Early Kernel Boot Messages.
Jan 12 09:46:53 localhost systemd[1]: klog.service: Job klog.service/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
Jan 12 09:46:53 localhost systemd[1]: root.mount: Job root.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /home.
Jan 12 09:46:53 localhost systemd[1]: home.mount: Job home.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /boot/grub2/x86_64-efi.
Jan 12 09:46:53 localhost systemd[1]: boot-grub2-x86_64\x2defi.mount: Job boot-grub2-x86_64\x2defi.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /tmp.
Jan 12 09:46:53 localhost systemd[1]: tmp.mount: Job tmp.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /usr/local.
Jan 12 09:46:53 localhost systemd[1]: usr-local.mount: Job usr-local.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /.snapshots.
Jan 12 09:46:53 localhost systemd[1]: \x2esnapshots.mount: Job \x2esnapshots.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /srv.
Jan 12 09:46:53 localhost systemd[1]: srv.mount: Job srv.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /boot/grub2/i386-pc.
Jan 12 09:46:53 localhost systemd[1]: boot-grub2-i386\x2dpc.mount: Job boot-grub2-i386\x2dpc.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /opt.
Jan 12 09:46:53 localhost systemd[1]: opt.mount: Job opt.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-5d64b703\x2d4230\x2d4e81\x2d8264\x2d4bb436fd8546.device: Job dev-disk-by\x2duuid-5d64b703\x2d4230>
Jan 12 09:46:53 localhost systemd[1]: Unnecessary job was removed for /dev/ttyS2.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-bbfe15a0\x2d3906\x2d4b51\x2da0bd\x2df1c885520b04.device: Job dev-disk-by\x2duuid-bbfe15a0\x2d3906>
Jan 12 09:46:53 localhost systemd[1]: Timed out waiting for device /dev/disk/by-uuid/bbfe15a0-3906-4b51-a0bd-f1c885520b04.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /dev/disk/by-uuid/bbfe15a0-3906-4b51-a0bd-f1c885520b04.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for Swaps.
Jan 12 09:46:53 localhost systemd[1]: swap.target: Job swap.target/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-bbfe15a0\x2d3906\x2d4b51\x2da0bd\x2df1c885520b04.swap: Job dev-disk-by\x2duuid-bbfe15a0\x2d3906\x>
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-bbfe15a0\x2d3906\x2d4b51\x2da0bd\x2df1c885520b04.device: Job dev-disk-by\x2duuid-bbfe15a0\x2d3906>
Jan 12 09:46:53 localhost systemd[1]: Reached target Timer Units.
Jan 12 09:46:53 localhost systemd[1]: Starting Restore /run/initramfs on shutdown...
Jan 12 09:46:53 localhost systemd[1]: Reached target Host and Network Name Lookups.
Jan 12 09:46:53 localhost systemd[1]: Reached target System Time Synchronized.
Jan 12 09:46:53 localhost systemd[1]: syslog.socket: Deactivated successfully.
Jan 12 09:46:53 localhost systemd[1]: Closed Syslog Socket.
Jan 12 09:46:53 localhost systemd[1]: Reached target Login Prompts.
Jan 12 09:46:53 localhost systemd[1]: Reached target User and Group Name Lookups.
Jan 12 09:46:53 localhost systemd[1]: Reached target Network.
Jan 12 09:46:53 localhost systemd[1]: Reached target Network is Online.
Jan 12 09:46:53 localhost systemd[1]: Reached target Path Units.
Jan 12 09:46:53 localhost systemd[1]: Reached target Socket Units.
Jan 12 09:46:53 localhost systemd[1]: Started Emergency Shell.
Jan 12 09:46:53 localhost systemd[1]: Reached target Emergency Mode.
Jan 12 09:46:53 localhost systemd[1]: Starting Tell Plymouth To Write Out Runtime Data...
Jan 12 09:46:53 localhost systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
Jan 12 09:46:53 localhost systemd[1]: Finished Restore /run/initramfs on shutdown.
Jan 12 09:46:53 localhost systemd[1]: Received SIGRTMIN+20 from PID 510 (plymouthd).
Jan 12 09:46:53 localhost systemd[1]: Finished Tell Plymouth To Write Out Runtime Data.
Jan 12 09:47:23 localhost udevadm[728]: Failed to wait for daemon to reply: Connection timed out
Jan 12 09:47:23 localhost systemd[1]: systemd-udev-settle.service: Main process exited, code=exited, status=1/FAILURE
Jan 12 09:47:23 localhost systemd[1]: systemd-udev-settle.service: Failed with result 'exit-code'.
Jan 12 09:47:23 localhost systemd[1]: Failed to start Wait for udev To Complete Device Initialization.
Actions #5

Updated by mkittler almost 2 years ago

  • Status changed from Workable to Feedback
  • Assignee set to mkittler

It works again after rebooting. Rebooting another time worked as well. Not exactly sure what the problem was. Maybe /dev/sda was really just too slow (causing the timeout which lead to the system being stuck in the rescue shell). SMART doesn't show any problems with the disk.

I've also updated the system (to get the latest dependencies like perl-SemVer) and enabled s390x tests (as temporary replacement for the setup on openqaworker1) and that seems to generally work.

Actions #6

Updated by mkittler almost 2 years ago

  • Status changed from Feedback to Resolved

It is still up and running. I suppose the setup can be considered stable enough (and we can always reopen the ticket). Rollback-steps regarding the s390x setup have been added to the openqaworker1 ticket.

Actions

Also available in: Atom PDF