action #138200: [qe-core] test fails in install_update - qam-minimal-RAID1@64bit-smp hit emergency shell - openQA Tests (public) - openSUSE Project Management Tool

Actions

Copy link

action #138200

closed

[qe-core] test fails in install_update - qam-minimal-RAID1@64bit-smp hit emergency shell

Added by szarate over 1 year ago. Updated 5 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

zluo

Category:

Bugs in existing tests

Target version:

QA (public) - QE-Core: Ready

Start date:

2023-10-18

Due date:

% Done:

Estimated time:

Difficulty:

Tags:

bugbusters

Description

Observation¶

Test is ending up in emergency shell with no apparent reason, can't figure out from the logs what happened, although there are a lot of warnings on the serial log from dracut.

openQA test in scenario sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-qam-minimal-RAID1@64bit-smp fails in
install_update

Test suite description¶

Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Installation of RAID1 using expert partitioner
minimal = base pattern, minimal (enhanced base) pattern are additional convenience paclkages

Reproducible¶

Fails since (at least) Build :31100:openssl-3 (current job)

Expected result¶

Last good: :30673:systemd (or more recent)

Further details¶

Always latest result in this scenario: latest

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by szarate over 1 year ago

Description updated (diff)

Actions

Copy link

Updated by szarate over 1 year ago

Marcus says nothing uses openssl in the boot process: https://suse.slack.com/archives/C02CCRM8946/p1697634197147119

Actions

Copy link

Updated by szarate over 1 year ago

bsc reported: https://bugzilla.suse.com/show_bug.cgi?id=1216381

Maybe a meteor shower with cosmic rays that hit the VM changed some bits and made it work, having https://openqa.suse.de/tests/12563612 passed, it could be related to the size of the disks or some infra stuff, perhaps some statistical investigation on OSD, might help? with HDD of 20 and HDD of 40GB

Actions

Copy link

Updated by szarate over 1 year ago

Created https://github.com/os-autoinst/os-autoinst/pull/2387

Actions

Copy link

Updated by zluo over 1 year ago

Assignee set to zluo

take over and check.

Actions

Copy link

Updated by szarate over 1 year ago

zluo wrote in #note-5:

take over and check.

Zaoliang, do check what I mentioned above:

it could be related to the size of the disks or some infra stuff, perhaps some statistical investigation on OSD, might help? with HDD of 20 and HDD of 40GB

If you update your openQA client, to the latest version, you can do something like `openqa-clone-job --skip-chained-deps --repeat=50 --within-instance $job BUILD=$POO_investigation _GROUPID="0" (see https://github.com/os-autoinst/openQA/pull/5331#issuecomment-1772686986)

Actions

Copy link

Updated by zluo over 1 year ago

Status changed from New to In Progress

try to verify this issue on my openQA. I have problem with openqa-worker after successful test runs (only a couple of, the most failed, worker stopped, but didn't hit emergency shell):

● openqa-worker-plain@1.service - openQA Worker #1
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-plain@.service; enabled; preset: disabled)
     Active: active (running) since Tue 2023-10-24 20:46:54 CEST; 10h ago
    Process: 3091 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/1 (code=exited, status=0/SUCCESS)
   Main PID: 3098 (worker)
      Tasks: 10 (limit: 4915)
        CPU: 27min 54.521s
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-plain@1.service
             ├─3098 /usr/bin/perl /usr/share/openqa/script/worker --instance 1
             └─3911 /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:seria>

Okt 25 07:12:45 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 114.58 s
Okt 25 07:14:40 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 142.03 s
Okt 25 07:17:02 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 206.08 s
Okt 25 07:20:28 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 228.03 s
Okt 25 07:24:17 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 285.92 s
Okt 25 07:29:03 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 152.51 s
Okt 25 07:31:35 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 124.26 s
Okt 25 07:33:40 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 200.26 s
Okt 25 07:37:00 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 296.79 s
Okt 25 07:41:57 quake1 worker[3098]: [warn] A QEMU instance using the current pool directory is still running (PID: 3911) - checking again for web UI 'localhost' in 217.92 s
~

I think there is also an issue with openqa-worker. see http://10.168.192.143/tests/197 it just stoped or incomplete. worker cannot be released for next job.

Actions

Copy link

Updated by zluo over 1 year ago

Reason: api failure: 400 response: Can't write to file "/var/lib/openqa/testresults/00000/00000092-sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-Build:31100:openssl-3-qam-minimal-RAID1@64bit-smp/nOk6jXKPye": No space left on device at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line …
Scheduled product: job has not been created by posting an ISO
Assigned worker: quake1:1

I think this is clear now. no space left on device. 20GB is too small.

Actions

Copy link

Updated by zluo over 1 year ago

opened https://progress.opensuse.org/issues/138464

Actions

Copy link

#10

Updated by zluo over 1 year ago

zluo wrote in #note-8:

Reason: api failure: 400 response: Can't write to file "/var/lib/openqa/testresults/00000/00000092-sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-Build:31100:openssl-3-qam-minimal-RAID1@64bit-smp/nOk6jXKPye": No space left on device at /usr/share/openqa/script/../lib/OpenQA/Schema/Result/JobModules.pm line …
Scheduled product: job has not been created by posting an ISO
Assigned worker: quake1:1

I think this is clear now. no space left on device. 20GB is too small.

I tried even with 40GB, but I have same or similar issue. I don't think this is really related to disk space.

Actions

Copy link

#11

Updated by zluo over 1 year ago

http://10.168.192.143/tests/208#step/update_minimal/86
hit emergency shell at different place.
the only difference to successful test run: there is a warning about /dev/dist/by-id/md-uuid-xx does not exist.
We had this issue before:
https://openqa.suse.de/tests/12466822#step/update_minimal/85

Actions

Copy link

#12

Updated by szarate over 1 year ago

zluo wrote in #note-11:

http://10.168.192.143/tests/208#step/update_minimal/86
hit emergency shell at different place.
the only difference to successful test run: there is a warning about /dev/dist/by-id/md-uuid-xx does not exist.
We had this issue before:
https://openqa.suse.de/tests/12466822#step/update_minimal/85

Thanks a lot, that's the issue we're looking for, ask the dev, what kind of logs should we collect, and lets catch that in openQA (let somebody else work on that in #135821)

Actions

Copy link

#13

Updated by zluo over 1 year ago

Now I can reproduce this issue: https://openqa.suse.de/tests/12736507#step/install_update/138
but I don't see that /run/initramfs/rdsosdebug.txt got saved somewhere.
@szarate can you check this please?

Is this what we need?
https://openqa.suse.de/tests/12725295#step/install_update/36

Actions

Copy link

#14

Updated by zluo over 1 year ago

Status changed from In Progress to Feedback

since the issue reported in https://bugzilla.suse.com/show_bug.cgi?id=1216381 is WIP. I think we can set this ticket for feedback and check the bugfix later.

Actions

Copy link

#15

Updated by rfan1 10 months ago

Related to action #165084: [qe-core] qam-minimal-RAID1 sometimes ends in emergency shell added

Actions

Copy link

#16

Updated by mgrifalconi 5 months ago

Status changed from Feedback to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Tests (public)

Tags

Custom queries

action #138200

[qe-core] test fails in install_update - qam-minimal-RAID1@64bit-smp hit emergency shell

Observation¶

Test suite description¶

Reproducible¶

Expected result¶

Further details¶

Updated by szarate over 1 year ago

Updated by szarate over 1 year ago

Updated by szarate over 1 year ago

Updated by szarate over 1 year ago

Updated by zluo over 1 year ago

Updated by szarate over 1 year ago

Updated by zluo over 1 year ago

Updated by zluo over 1 year ago

Updated by zluo over 1 year ago

Updated by zluo over 1 year ago

Updated by zluo over 1 year ago

Updated by szarate over 1 year ago

Updated by zluo over 1 year ago

Updated by zluo over 1 year ago

Updated by rfan1 10 months ago

Updated by mgrifalconi 5 months ago