action #138200
open[qe-core] test fails in install_update - qam-minimal-RAID1@64bit-smp hit emergency shell
0%
Description
Observation¶
Test is ending up in emergency shell with no apparent reason, can't figure out from the logs what happened, although there are a lot of warnings on the serial log from dracut.
openQA test in scenario sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-qam-minimal-RAID1@64bit-smp fails in
install_update
Test suite description¶
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Installation of RAID1 using expert partitioner
minimal = base pattern, minimal (enhanced base) pattern are additional convenience paclkages
Reproducible¶
Fails since (at least) Build :31100:openssl-3 (current job)
Expected result¶
Last good: :30673:systemd (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by szarate about 1 year ago
Marcus says nothing uses openssl in the boot process: https://suse.slack.com/archives/C02CCRM8946/p1697634197147119
Updated by szarate about 1 year ago
bsc reported: https://bugzilla.suse.com/show_bug.cgi?id=1216381
Maybe a meteor shower with cosmic rays that hit the VM changed some bits and made it work, having https://openqa.suse.de/tests/12563612 passed, it could be related to the size of the disks or some infra stuff, perhaps some statistical investigation on OSD, might help? with HDD of 20 and HDD of 40GB
Updated by szarate about 1 year ago
Updated by szarate about 1 year ago
zluo wrote in #note-5:
take over and check.
Zaoliang, do check what I mentioned above:
it could be related to the size of the disks or some infra stuff, perhaps some statistical investigation on OSD, might help? with HDD of 20 and HDD of 40GB
If you update your openQA client, to the latest version, you can do something like `openqa-clone-job --skip-chained-deps --repeat=50 --within-instance $job BUILD=$POO_investigation _GROUPID="0" (see https://github.com/os-autoinst/openQA/pull/5331#issuecomment-1772686986)
Updated by zluo about 1 year ago
- Status changed from New to In Progress
try to verify this issue on my openQA. I have problem with openqa-worker after successful test runs (only a couple of, the most failed, worker stopped, but didn't hit emergency shell):
● openqa-worker-plain@1.service - openQA Worker #1
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-plain@.service; enabled; preset: disabled)
Active: active (running) since Tue 2023-10-24 20:46:54 CEST; 10h ago
Process: 3091 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/1 (code=exited, status=0/SUCCESS)
Main PID: 3098 (worker)
Tasks: 10 (limit: 4915)
CPU: 27min 54.521s
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@1.service
├─3098 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
└─3911 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:seria>
Okt 25 07:12:45 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 114.58 s
Okt 25 07:14:40 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 142.03 s
Okt 25 07:17:02 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 206.08 s
Okt 25 07:20:28 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 228.03 s
Okt 25 07:24:17 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 285.92 s
Okt 25 07:29:03 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 152.51 s
Okt 25 07:31:35 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 124.26 s
Okt 25 07:33:40 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 200.26 s
Okt 25 07:37:00 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 296.79 s
Okt 25 07:41:57 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 217.92 s
~
I think there is also an issue with openqa-worker. see http://10.168.192.143/tests/197 it just stoped or incomplete. worker cannot be released for next job.
Updated by zluo about 1 year ago
Reason: api failure: 400 response: Can't write to file "/var/lib/openqa/testresults/00000/00000092-sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-Build:31100:openssl-3-qam-minimal-RAID1@64bit-smp/nOk6jXKPye": No space left on device at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line …
Scheduled product: job has not been created by posting an ISO
Assigned worker: quake1:1
I think this is clear now. no space left on device. 20GB is too small.
Updated by zluo about 1 year ago
Updated by zluo about 1 year ago
zluo wrote in #note-8:
Reason: api failure: 400 response: Can't write to file "/var/lib/openqa/testresults/00000/00000092-sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-Build:31100:openssl-3-qam-minimal-RAID1@64bit-smp/nOk6jXKPye": No space left on device at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line … Scheduled product: job has not been created by posting an ISO Assigned worker: quake1:1
I think this is clear now. no space left on device. 20GB is too small.
I tried even with 40GB, but I have same or similar issue. I don't think this is really related to disk space.
Updated by zluo about 1 year ago
http://10.168.192.143/tests/208#step/update_minimal/86
hit emergency shell at different place.
the only difference to successful test run: there is a warning about /dev/dist/by-id/md-uuid-xx does not exist.
We had this issue before:
https://openqa.suse.de/tests/12466822#step/update_minimal/85
Updated by szarate about 1 year ago
zluo wrote in #note-11:
http://10.168.192.143/tests/208#step/update_minimal/86
hit emergency shell at different place.
the only difference to successful test run: there is a warning about /dev/dist/by-id/md-uuid-xx does not exist.
We had this issue before:
https://openqa.suse.de/tests/12466822#step/update_minimal/85
Thanks a lot, that's the issue we're looking for, ask the dev, what kind of logs should we collect, and lets catch that in openQA (let somebody else work on that in #135821)
Updated by zluo about 1 year ago
Now I can reproduce this issue: https://openqa.suse.de/tests/12736507#step/install_update/138
but I don't see that /run/initramfs/rdsosdebug.txt got saved somewhere.
@szarate can you check this please?
Is this what we need?
https://openqa.suse.de/tests/12725295#step/install_update/36
Updated by zluo 11 months ago
- Status changed from In Progress to Feedback
since the issue reported in https://bugzilla.suse.com/show_bug.cgi?id=1216381 is WIP. I think we can set this ticket for feedback and check the bugfix later.
Updated by rfan1 4 months ago
- Related to action #165084: [qe-core] qam-minimal-RAID1 sometimes ends in emergency shell added