action #122998
closedo3 worker rebel is down; was: inconsistent package database or filesystem corruption size:M
0%
Description
Observation¶
Note: The worker is currently completely offline and needs to be recovered first.
Tina Müller tried to install a perl package on o3 workers and reported "file conflicts":
on rebel I tried to install perl-IO-Compress-Lzma and got:
Loading repository data...
Reading installed packages...
Resolving package dependencies...
The following 8 NEW packages are going to be installed:
openQA-client openQA-common openQA-worker perl-Compress-Raw-Bzip2 perl-Compress-Raw-Lzma perl-Compress-Raw-Zlib perl-IO-Compress perl-IO-Compress-Lzma
so it tried to install much more than requested, and then got conflicts like:
Detected 11 file conflicts:
File /usr/share/openqa/lib/OpenQA/CacheService/Client.pm
from install of
openQA-worker-4.6.1673443568.582a087-lp154.5532.1.noarch (devel_openQA)
conflicts with file from package
perl-Net-SSH2-0.73-lp153.2.2.x86_64 (@System)
File /usr/share/openqa/lib/OpenQA/CacheService/Controller/API.pm
from install of
openQA-worker-4.6.1673443568.582a087-lp154.5532.1.noarch (devel_openQA)
conflicts with file from package
perl-Net-SSH2-0.73-lp153.2.2.x86_64 (@System)
looks like some corruption to me.
Acceptance criteria¶
- AC1: Machine is online
- AC2: No more package conflicts
Suggestions¶
- IPMI reboot
- Use the old ipmi command, the machine isn't in the new security zone yet
- May need to reinstall the machine
Updated by okurz almost 2 years ago
- Status changed from In Progress to New
- Assignee deleted (
okurz)
Calls like df -h
take very long to execute. I suspect a deeper problem. system is up since 37 days. Let's try a reboot. Triggered a reboot and now – not very unexpectedly – the machine does not come up again after reboot so that's the next thing to check.
Updated by livdywan almost 2 years ago
- Subject changed from o3 worker rebel seems to have inconsistent package database or filesystem corruption to o3 worker rebel is down; was: inconsistent package database or filesystem corruption size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by mkittler almost 2 years ago
It was in the rescue shell and booted after pressing Ctrl + D. Journalctl shows the following relevant output:
Jan 12 09:45:23 localhost systemd[1]: Starting Apply Kernel Variables...
Jan 12 09:45:23 localhost systemd[1]: Finished Apply Kernel Variables.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-5d64b703\x2d4230\x2d4e81\x2d8264\x2d4bb436fd8546.device: Job dev-disk-by\x2duuid-5d64b703\x2d4230>
Jan 12 09:46:53 localhost systemd[1]: Timed out waiting for device /dev/disk/by-uuid/5d64b703-4230-4e81-8264-4bb436fd8546.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /root.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for Local File Systems.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for Early Kernel Boot Messages.
Jan 12 09:46:53 localhost systemd[1]: klog.service: Job klog.service/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: local-fs.target: Job local-fs.target/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: local-fs.target: Triggering OnFailure= dependencies.
Jan 12 09:46:53 localhost systemd[1]: root.mount: Job root.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /home.
Jan 12 09:46:53 localhost systemd[1]: home.mount: Job home.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /boot/grub2/x86_64-efi.
Jan 12 09:46:53 localhost systemd[1]: boot-grub2-x86_64\x2defi.mount: Job boot-grub2-x86_64\x2defi.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /tmp.
Jan 12 09:46:53 localhost systemd[1]: tmp.mount: Job tmp.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /usr/local.
Jan 12 09:46:53 localhost systemd[1]: usr-local.mount: Job usr-local.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /.snapshots.
Jan 12 09:46:53 localhost systemd[1]: \x2esnapshots.mount: Job \x2esnapshots.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /srv.
Jan 12 09:46:53 localhost systemd[1]: srv.mount: Job srv.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /boot/grub2/i386-pc.
Jan 12 09:46:53 localhost systemd[1]: boot-grub2-i386\x2dpc.mount: Job boot-grub2-i386\x2dpc.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /opt.
Jan 12 09:46:53 localhost systemd[1]: opt.mount: Job opt.mount/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-5d64b703\x2d4230\x2d4e81\x2d8264\x2d4bb436fd8546.device: Job dev-disk-by\x2duuid-5d64b703\x2d4230>
Jan 12 09:46:53 localhost systemd[1]: Unnecessary job was removed for /dev/ttyS2.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-bbfe15a0\x2d3906\x2d4b51\x2da0bd\x2df1c885520b04.device: Job dev-disk-by\x2duuid-bbfe15a0\x2d3906>
Jan 12 09:46:53 localhost systemd[1]: Timed out waiting for device /dev/disk/by-uuid/bbfe15a0-3906-4b51-a0bd-f1c885520b04.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for /dev/disk/by-uuid/bbfe15a0-3906-4b51-a0bd-f1c885520b04.
Jan 12 09:46:53 localhost systemd[1]: Dependency failed for Swaps.
Jan 12 09:46:53 localhost systemd[1]: swap.target: Job swap.target/start failed with result 'dependency'.
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-bbfe15a0\x2d3906\x2d4b51\x2da0bd\x2df1c885520b04.swap: Job dev-disk-by\x2duuid-bbfe15a0\x2d3906\x>
Jan 12 09:46:53 localhost systemd[1]: dev-disk-by\x2duuid-bbfe15a0\x2d3906\x2d4b51\x2da0bd\x2df1c885520b04.device: Job dev-disk-by\x2duuid-bbfe15a0\x2d3906>
Jan 12 09:46:53 localhost systemd[1]: Reached target Timer Units.
Jan 12 09:46:53 localhost systemd[1]: Starting Restore /run/initramfs on shutdown...
Jan 12 09:46:53 localhost systemd[1]: Reached target Host and Network Name Lookups.
Jan 12 09:46:53 localhost systemd[1]: Reached target System Time Synchronized.
Jan 12 09:46:53 localhost systemd[1]: syslog.socket: Deactivated successfully.
Jan 12 09:46:53 localhost systemd[1]: Closed Syslog Socket.
Jan 12 09:46:53 localhost systemd[1]: Reached target Login Prompts.
Jan 12 09:46:53 localhost systemd[1]: Reached target User and Group Name Lookups.
Jan 12 09:46:53 localhost systemd[1]: Reached target Network.
Jan 12 09:46:53 localhost systemd[1]: Reached target Network is Online.
Jan 12 09:46:53 localhost systemd[1]: Reached target Path Units.
Jan 12 09:46:53 localhost systemd[1]: Reached target Socket Units.
Jan 12 09:46:53 localhost systemd[1]: Started Emergency Shell.
Jan 12 09:46:53 localhost systemd[1]: Reached target Emergency Mode.
Jan 12 09:46:53 localhost systemd[1]: Starting Tell Plymouth To Write Out Runtime Data...
Jan 12 09:46:53 localhost systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
Jan 12 09:46:53 localhost systemd[1]: Finished Restore /run/initramfs on shutdown.
Jan 12 09:46:53 localhost systemd[1]: Received SIGRTMIN+20 from PID 510 (plymouthd).
Jan 12 09:46:53 localhost systemd[1]: Finished Tell Plymouth To Write Out Runtime Data.
Jan 12 09:47:23 localhost udevadm[728]: Failed to wait for daemon to reply: Connection timed out
Jan 12 09:47:23 localhost systemd[1]: systemd-udev-settle.service: Main process exited, code=exited, status=1/FAILURE
Jan 12 09:47:23 localhost systemd[1]: systemd-udev-settle.service: Failed with result 'exit-code'.
Jan 12 09:47:23 localhost systemd[1]: Failed to start Wait for udev To Complete Device Initialization.
Updated by mkittler almost 2 years ago
- Status changed from Workable to Feedback
- Assignee set to mkittler
It works again after rebooting. Rebooting another time worked as well. Not exactly sure what the problem was. Maybe /dev/sda
was really just too slow (causing the timeout which lead to the system being stuck in the rescue shell). SMART doesn't show any problems with the disk.
I've also updated the system (to get the latest dependencies like perl-SemVer) and enabled s390x tests (as temporary replacement for the setup on openqaworker1) and that seems to generally work.
Updated by mkittler almost 2 years ago
- Status changed from Feedback to Resolved
It is still up and running. I suppose the setup can be considered stable enough (and we can always reopen the ticket). Rollback-steps regarding the s390x setup have been added to the openqaworker1 ticket.