Project

General

Profile

Actions

action #138200

open

[qe-core] test fails in install_update - qam-minimal-RAID1@64bit-smp hit emergency shell

Added by szarate 6 months ago. Updated 4 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
Start date:
2023-10-18
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

Test is ending up in emergency shell with no apparent reason, can't figure out from the logs what happened, although there are a lot of warnings on the serial log from dracut.

openQA test in scenario sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-qam-minimal-RAID1@64bit-smp fails in
install_update

Test suite description

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Installation of RAID1 using expert partitioner
minimal = base pattern, minimal (enhanced base) pattern are additional convenience paclkages

Reproducible

Fails since (at least) Build :31100:openssl-3 (current job)

Expected result

Last good: :30673:systemd (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by szarate 6 months ago

  • Description updated (diff)
Actions #2

Updated by szarate 6 months ago

Marcus says nothing uses openssl in the boot process: https://suse.slack.com/archives/C02CCRM8946/p1697634197147119

Actions #3

Updated by szarate 6 months ago

bsc reported: https://bugzilla.suse.com/show_bug.cgi?id=1216381

Maybe a meteor shower with cosmic rays that hit the VM changed some bits and made it work, having https://openqa.suse.de/tests/12563612 passed, it could be related to the size of the disks or some infra stuff, perhaps some statistical investigation on OSD, might help? with HDD of 20 and HDD of 40GB

Actions #5

Updated by zluo 6 months ago

  • Assignee set to zluo

take over and check.

Actions #6

Updated by szarate 6 months ago

zluo wrote in #note-5:

take over and check.

Zaoliang, do check what I mentioned above:

it could be related to the size of the disks or some infra stuff, perhaps some statistical investigation on OSD, might help? with HDD of 20 and HDD of 40GB

If you update your openQA client, to the latest version, you can do something like `openqa-clone-job --skip-chained-deps --repeat=50 --within-instance $job BUILD=$POO_investigation _GROUPID="0" (see https://github.com/os-autoinst/openQA/pull/5331#issuecomment-1772686986)

Actions #7

Updated by zluo 6 months ago

  • Status changed from New to In Progress

try to verify this issue on my openQA. I have problem with openqa-worker after successful test runs (only a couple of, the most failed, worker stopped, but didn't hit emergency shell):

● openqa-worker-plain@1.service - openQA Worker #1
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-plain@.service; enabled; preset: disabled)
     Active: active (running) since Tue 2023-10-24 20:46:54 CEST; 10h ago
    Process: 3091 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/1 (code=exited, status=0/SUCCESS)
   Main PID: 3098 (worker)
      Tasks: 10 (limit: 4915)
        CPU: 27min 54.521s
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@1.service
             ├─3098 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
             └─3911 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:seria>

Okt 25 07:12:45 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 114.58 s
Okt 25 07:14:40 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 142.03 s
Okt 25 07:17:02 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 206.08 s
Okt 25 07:20:28 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 228.03 s
Okt 25 07:24:17 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 285.92 s
Okt 25 07:29:03 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 152.51 s
Okt 25 07:31:35 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 124.26 s
Okt 25 07:33:40 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 200.26 s
Okt 25 07:37:00 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 296.79 s
Okt 25 07:41:57 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 217.92 s
~

I think there is also an issue with openqa-worker. see http://10.168.192.143/tests/197 it just stoped or incomplete. worker cannot be released for next job.

Actions #8

Updated by zluo 6 months ago

Reason: api failure: 400 response: Can't write to file "/var/lib/openqa/testresults/00000/00000092-sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-Build:31100:openssl-3-qam-minimal-RAID1@64bit-smp/nOk6jXKPye": No space left on device at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line …
Scheduled product: job has not been created by posting an ISO
Assigned worker: quake1:1

I think this is clear now. no space left on device. 20GB is too small.

Actions #10

Updated by zluo 6 months ago

zluo wrote in #note-8:

Reason: api failure: 400 response: Can't write to file "/var/lib/openqa/testresults/00000/00000092-sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-Build:31100:openssl-3-qam-minimal-RAID1@64bit-smp/nOk6jXKPye": No space left on device at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line …
Scheduled product: job has not been created by posting an ISO
Assigned worker: quake1:1

I think this is clear now. no space left on device. 20GB is too small.

I tried even with 40GB, but I have same or similar issue. I don't think this is really related to disk space.

Actions #11

Updated by zluo 6 months ago

http://10.168.192.143/tests/208#step/update_minimal/86
hit emergency shell at different place.
the only difference to successful test run: there is a warning about /dev/dist/by-id/md-uuid-xx does not exist.
We had this issue before:
https://openqa.suse.de/tests/12466822#step/update_minimal/85

Actions #12

Updated by szarate 6 months ago

zluo wrote in #note-11:

http://10.168.192.143/tests/208#step/update_minimal/86
hit emergency shell at different place.
the only difference to successful test run: there is a warning about /dev/dist/by-id/md-uuid-xx does not exist.
We had this issue before:
https://openqa.suse.de/tests/12466822#step/update_minimal/85

Thanks a lot, that's the issue we're looking for, ask the dev, what kind of logs should we collect, and lets catch that in openQA (let somebody else work on that in #135821)

Actions #13

Updated by zluo 6 months ago

Now I can reproduce this issue: https://openqa.suse.de/tests/12736507#step/install_update/138
but I don't see that /run/initramfs/rdsosdebug.txt got saved somewhere.
@szarate can you check this please?

Is this what we need?
https://openqa.suse.de/tests/12725295#step/install_update/36

Actions #14

Updated by zluo 4 months ago

  • Status changed from In Progress to Feedback

since the issue reported in https://bugzilla.suse.com/show_bug.cgi?id=1216381 is WIP. I think we can set this ticket for feedback and check the bugfix later.

Actions

Also available in: Atom PDF