openSUSE Project Management Tool: Issueshttps://progress.opensuse.org/https://progress.opensuse.org/themes/openSUSE/favicon/favicon.ico?15829177842023-10-30T12:45:10ZopenSUSE Project Management Tool
Redmine openQA Infrastructure - action #138746 (Resolved): [tools] s390x VM randomly fails to open QCOW d...https://progress.opensuse.org/issues/1387462023-10-30T12:45:10ZMDouchamartin.doucha@suse.com
<p>s390x tests randomly fail to boot because the VM does not have permission to open the disk image. Multiple workers have the same issue. Restarting the job usually fixes the issue. Examples:</p>
<p><a href="https://openqa.suse.de/tests/12711015#step/bootloader_zkvm/31" class="external">https://openqa.suse.de/tests/12711015#step/bootloader_zkvm/31</a><br>
<a href="https://openqa.suse.de/tests/12711015/logfile?filename=autoinst-log.txt" class="external">https://openqa.suse.de/tests/12711015/logfile?filename=autoinst-log.txt</a></p>
<p><a href="https://openqa.suse.de/tests/12716015#step/bootloader_zkvm/31" class="external">https://openqa.suse.de/tests/12716015#step/bootloader_zkvm/31</a><br>
<a href="https://openqa.suse.de/tests/12716015/logfile?filename=autoinst-log.txt" class="external">https://openqa.suse.de/tests/12716015/logfile?filename=autoinst-log.txt</a></p>
<p><a href="https://openqa.suse.de/tests/12708886#step/bootloader_start/34" class="external">https://openqa.suse.de/tests/12708886#step/bootloader_start/34</a><br>
<a href="https://openqa.suse.de/tests/12708886/logfile?filename=autoinst-log.txt" class="external">https://openqa.suse.de/tests/12708886/logfile?filename=autoinst-log.txt</a></p>
<pre><code>[2023-10-28T00:17:57.550325+02:00] [debug] [pid:56810] [run_ssh_cmd(virsh start openQA-SUT-6 2> >(tee /tmp/os-autoinst-openQA-SUT-6-stderr.log >&2))] stderr:
error: Failed to start domain 'openQA-SUT-6'
error: internal error: process exited while connecting to monitor: 2023-10-27T22:17:57.331249Z qemu-system-s390x: -blockdev {"driver":"file","filename":"/var/lib/libvirt/images//SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231027-1-Server-DVD-Updates-s390x-kvm.qcow2","node-name":"libvirt-3-storage","cache":{"direct":false,"no-flush":true},"auto-read-only":true,"discard":"unmap"}: Could not open '/var/lib/libvirt/images//SLES-15-SP4-s390x-mru-install-minimal-with-addons-Build20231027-1-Server-DVD-Updates-s390x-kvm.qcow2': Permission denied
</code></pre> openQA Project - action #124493 (Resolved): openqa-clone-job --skip-deps behavior contradicts doc...https://progress.opensuse.org/issues/1244932023-02-14T14:49:46ZMDouchamartin.doucha@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Both <a href="http://open.qa/docs/#_handling_of_dependencies_when_cloning_jobs" class="external">OpenQA documentation</a> and <code>openqa-clone-job --help</code> say that <code>--skip-deps</code> and <code>--skip-chained-deps</code> should only prevent cloning of <strong>parent</strong> jobs. In reality, however, both options will prevent cloning of all (chained) dependencies regardless of parent/child relationship (even when you specify <code>--clone-children</code>). This means there is currently no way to clone a dependency subtree without parents using <code>openqa-clone-job</code>. The subtree can only be restarted in webUI which does not support modifying settings of the restarted jobs.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> There is a way to clone a dependency subtree without parents using <code>openqa-clone-job</code> (in accordance with the documentation).</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>It probably worked in the past, maybe a regression?</li>
<li>Create a set of dependent jobs locally (e.g. by setting dependencies manually within the database or by cloning a set of jobs from production) and run <code>openqa-clone-job</code> locally with parameters mention in description</li>
<li>Extend unit tests</li>
</ul>
openQA Tests - action #116287 (Rejected): [qe-core][s390x] SSH serial terminal connection issues ...https://progress.opensuse.org/issues/1162872022-09-06T13:54:08ZMDouchamartin.doucha@suse.com
<p>s390x livepatch tests had a lot of installation failures this month due to SSH serial terminal connection failures. Interestingly enough, the connection failures seem to happen around the same module step. serial_terminal.txt output appears to be out of sync with the terminal because part of the commands and output is missing even though it's listed in the update_kernel module details. The dmesg output in serial0.txt often (but not always) shows some key exchange SSH error followed by output from a completely different job:</p>
<pre><code>Welcome to SUSE Linux Enterprise Server 15 SP2 (s390x) - Kernel 5.3.18-24.83-default (ttysclp0).
eth0: 10.161.145.86 fe80::5054:ff:fe84:f877
susetest login: root
Password:
Last login: Mon Sep 5 10:18:10 from 10.160.0.147
susetest:~ #�(B systemctl is-active network
active
susetest:~ #�(B systemctl is-active sshd
active
susetest:~ #�(B 2022-09-05T10:25:03.604370-04:00 susetest sshd[4272]: error: kex_exchange_identification: Connection closed by remote host
2022-09-05T10:25:04.844743-04:00 susetest sshd[4273]: error: kex_exchange_identification: Connection closed by remote host
[ 107.444474] LTP: starting DI000 (dirty)
[ 107.445525] LTP: starting DS000 (dio_sparse)
[ 107.466125] LTP: starting abort01
[ 107.758318] LTP: starting accept01
</code></pre>
<p>12-SP4: <a href="https://openqa.suse.de/tests/9438804#step/update_kernel/337" class="external">https://openqa.suse.de/tests/9438804#step/update_kernel/337</a><br>
15-SP2: <a href="https://openqa.suse.de/tests/9457752#step/update_kernel/337" class="external">https://openqa.suse.de/tests/9457752#step/update_kernel/337</a><br>
15-SP3: <a href="https://openqa.suse.de/tests/9458645#step/update_kernel/337" class="external">https://openqa.suse.de/tests/9458645#step/update_kernel/337</a><br>
15-SP4: <a href="https://openqa.suse.de/tests/9455666#step/update_kernel/199" class="external">https://openqa.suse.de/tests/9455666#step/update_kernel/199</a></p>
<p>I could not find any such connection failure on SLE-12SP5. Other SLE releases don't support s390x livepatches and KOTD tests don't show this kind of issue. This looks like a kernel bug but I'd like an s390x expert to look at this before I create a Bugzilla ticket. And of course this has exposed logging issues in OpenQA.</p>
openQA Infrastructure - action #115925 (New): aarch64: Random QEMU failures while retrieving host...https://progress.opensuse.org/issues/1159252022-08-29T08:44:02ZMDouchamartin.doucha@suse.com
<p>Since the worker upgrade to Leap 15.4, some aarch64 jobs have randomly failed with the following error: <code>qemu-system-aarch64: Failed to retrieve host CPU features</code><br>
Example: <a href="https://openqa.suse.de/tests/9401654" class="external">https://openqa.suse.de/tests/9401654</a></p>
openQA Infrastructure - action #108266 (New): grenache: script_run() commands randomly time out s...https://progress.opensuse.org/issues/1082662022-03-14T09:36:30ZMDouchamartin.doucha@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>Since the NBG server room was moved, I'm seeing a lot of random script_run() command timeouts on grenache. I suspect network issues.<br>
<a href="https://openqa.suse.de/tests/8320677#step/sighold02/12">https://openqa.suse.de/tests/8320677#step/sighold02/12</a><br>
<a href="https://openqa.suse.de/tests/8294410#step/fallocate06/8">https://openqa.suse.de/tests/8294410#step/fallocate06/8</a><br>
<a href="https://openqa.suse.de/tests/8294334#step/boot_ltp/42">https://openqa.suse.de/tests/8294334#step/boot_ltp/42</a></p>
<pre><code> Test died: command 'vmstat -w' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.
# Test died: Timed out waiting for LTP test case which may still be running or the OS may have crashed! at sle/tests/kernel/run_ltp.pm line 337.
# Test died: command 'rpm -qi kernel-default > /tmp/kernel-pkg.txt 2>&1' timed out at /usr/lib/os-autoinst/testapi.pm line 1039.
main::init_backend() called at /usr/bin/isotovideo line 258
[2022-03-09T16:12:24.052826+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00229895696975291 seconds: Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927.
testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317
run_ltp::run(run_ltp=HASH(0x1001999aee8), LTP::TestInfo=HASH(0x1001b24d630)) called at /usr/lib/os-autoinst/basetest.pm line 356
cf. last good
[2022-03-12T07:06:13.797172+01:00] [info] ::: consoles::serial_screen::read_until: Matched output from SUT in 1 loops & 0.00224426796194166 seconds:
Use of uninitialized value $regexp in concatenation (.) or string at /usr/lib/os-autoinst/testapi.pm line 927.
testapi::wait_serial(undef, undef, 0, "no_regex", 1) called at sle/tests/kernel/run_ltp.pm line 317
run_ltp::run(run_ltp=HASH(0x1003570fb08), LTP::TestInfo=HASH(0x1003547afa8)) called at /usr/lib/os-autoinst/basetest.pm line 356
eval {...} called at /usr/lib/os-autoinst/basetest.pm line 354
</code></pre>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> Users no longer file complaints about script_run timing out</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Find a reproducer or database query to identify recent cases e.g. ask Martin. EDIT: mdoucha responded that there is no special query available. Next suggestion: Just pick any recent job where the problem happened, trigger 1k jobs for investigation, e.g. according priority or over weekend, etc.</li>
<li>Look into warnings in logs</li>
<li>"Use of uninitialized value $regexp in concatenation (.) or string" is already fixed</li>
<li>last good: <a href="https://openqa.suse.de/tests/8315985">https://openqa.suse.de/tests/8315985</a></li>
<li>[debug] Current version is 4.6.1647014989.7540333c [interface v25]
<ul>
<li>Do <code>git log --no-merges 7540333c..$first_bad</code></li>
</ul></li>
<li><p>Investigate the timeout handling c.f. recent improvements to VNC connection code and handling former blocking code paths</p>
<ul>
<li>We don't have a screenshot to compare the serial output to</li>
<li>Maybe we can check the serial logs for comparison?</li>
</ul></li>
</ul>
<p>All these occurences are on the same machine, which is s390x-kvm-sle12</p>
<p>One problem I see is that in <a href="https://openqa.suse.de/tests/8505116#step/shutdown_ltp/6">https://openqa.suse.de/tests/8505116#step/shutdown_ltp/6</a> we have a serial terminal. If there would be VNC we would be able to see if the command was executed or not. I also don't see the commands in <a href="https://openqa.suse.de/tests/8505116/logfile?filename=serial_terminal.txt">https://openqa.suse.de/tests/8505116/logfile?filename=serial_terminal.txt</a> nor serial0.txt .</p>
<p>We should try to resolve the ambiguity if commands just never write to the serial terminal as they time out or if actual data is going missing from SUT to worker.</p>
<p>What would you say, what is the best way to reproduce the issue? If we have a reproducer we can try to make it as small as possible and then fix it, maybe just increase the timeout. Maybe ensure that we cath any console related processes in the background if they are still responsive.</p>
<a name="Further-suggestions-from-SUSE-QE-Tools-unblock-2022-05-11"></a>
<h3 >Further suggestions from SUSE QE Tools unblock 2022-05-11<a href="#Further-suggestions-from-SUSE-QE-Tools-unblock-2022-05-11" class="wiki-anchor">¶</a></h3>
<ul>
<li>As suggested in <a class="issue tracker-4 status-1 priority-4 priority-default child" title="action: grenache: script_run() commands randomly time out since server room move (New)" href="https://progress.opensuse.org/issues/108266#note-22">#108266#note-22</a>, similar as we do for openQA worker hosts there should be monitoring to critical components (out of scope for SUSE QE Tools, delegate to SUSE QE Core)</li>
<li>within the code called by script_run using ssh
<ul>
<li>retry</li>
<li>check if the ssh connection is still there at all</li>
<li>provide more details when failing</li>
</ul></li>
<li>Add in the message on timeout how long we waited</li>
</ul>
openQA Infrastructure - action #107989 (Resolved): CPU-specific worker classeshttps://progress.opensuse.org/issues/1079892022-03-08T11:28:38ZMDouchamartin.doucha@suse.com
<p>We have a few tests which require specific CPU types, for example the CPU vulnerability mitigation tests. It'd be useful to have worker classes like <code>x86_64_amd</code> and <code>x86_64_intel</code> so that we can schedule tests on workers which have the required features or vulnerabilities.</p>
openQA Infrastructure - action #105867 (Resolved): OpenQA bot schedules jobs with incomplete INCI...https://progress.opensuse.org/issues/1058672022-02-03T10:23:54ZMDouchamartin.doucha@suse.com
<p>This week, the OpenQA bot has been scheduling kernel tests without adding the Basesystem/LTSS repository to INCIDENT_REPO. Only the livepatching repository was added. This happened on <a href="https://openqa.suse.de/tests/8085238#settings" class="external">SLE-12SP4</a>, <a href="https://openqa.suse.de/tests/8082278#settings" class="external">SLE-15SP2</a> (<a href="https://openqa.suse.de/tests/8081179#settings" class="external">twice</a>) and <a href="https://openqa.suse.de/tests/8087134#settings" class="external">SLE-15SP1</a>:</p>
<pre><code>INCIDENT_REPO=http://download.suse.de/ibs/SUSE:/Maintenance:/22660/SUSE_Updates_SLE-Module-Live-Patching_15-SP1_x86_64
</code></pre>
<p>Some of these tests have already been rescheduled with the correct settings but SLE-15SP1 is still affected. Current S:M:22660 incident data in QEM dashboard API:</p>
<pre><code>{"approved":false,"channels":["SUSE:SLE-15-SP1:Update","SUSE:Updates:SLE-Product-HA:15-SP1:x86_64","SUSE:Updates:SLE-Product-HA:15-SP1:s390x","SUSE:Updates:SLE-Product-HA:15-SP1:ppc64le","SUSE:Updates:SLE-Product-HA:15-SP1:aarch64","SUSE:Updates:Storage:6:aarch64","SUSE:Updates:Storage:6:x86_64","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:x86_64","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:s390x","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:ppc64le","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP3:aarch64","SUSE:Updates:SLE-Module-Live-Patching:15-SP1:x86_64","SUSE:Updates:SLE-Module-Live-Patching:15-SP1:ppc64le","SUSE:Updates:SUSE-CAASP:4.0:x86_64","SUSE:Updates:SLE-Product-SLES:15-SP1-BCL:x86_64","SUSE:Updates:SLE-Product-HPC:15-SP1-ESPOS:aarch64","SUSE:Updates:SLE-Product-HPC:15-SP1-ESPOS:x86_64","SUSE:Updates:SLE-Product-SLES_SAP:15-SP1:ppc64le","SUSE:Updates:SLE-Product-SLES_SAP:15-SP1:x86_64","SUSE:Updates:SLE-Product-SLES:15-SP1-LTSS:x86_64","SUSE:Updates:SLE-Product-SLES:15-SP1-LTSS:s390x","SUSE:Updates:SLE-Product-SLES:15-SP1-LTSS:ppc64le","SUSE:Updates:SLE-Product-SLES:15-SP1-LTSS:aarch64","SUSE:Updates:SLE-Product-HPC:15-SP1-LTSS:x86_64","SUSE:Updates:SLE-Product-HPC:15-SP1-LTSS:aarch64","SUSE:Updates:openSUSE-SLE:15.3","SUSE:Updates:openSUSE-SLE:15.4","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:aarch64","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:ppc64le","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:s390x","SUSE:Updates:SLE-Module-Development-Tools-OBS:15-SP4:x86_64"],"emu":false,"inReview":false,"inReviewQAM":false,"isActive":true,"number":22660,"packages":["dtb-aarch64","kernel-debug","kernel-default","kernel-docs","kernel-kvmsmall","kernel-livepatch-SLE15-SP1_Update_28","kernel-obs-build","kernel-obs-qa","kernel-source","kernel-syms","kernel-vanilla","kernel-zfcpdump"],"project":"SUSE:Maintenance:22660","rr_number":null}
</code></pre> openQA Project - action #99246 (Resolved): Published QCOW images appear to be uncompressedhttps://progress.opensuse.org/issues/992462021-09-24T11:41:40ZMDouchamartin.doucha@suse.com
<p>QCOW images generated by OpenQA install jobs appear to be uncompressed now.</p>
<p>ppc64le: <a href="https://openqa.suse.de/tests/7211627#downloads" class="external">https://openqa.suse.de/tests/7211627#downloads</a><br>
HDD_1: 885MB<br>
PUBLISH_HDD_1: 7.1GB</p>
<p>x86_64: <a href="https://openqa.suse.de/tests/7211569#downloads" class="external">https://openqa.suse.de/tests/7211569#downloads</a><br>
HDD_1: 976MB<br>
PUBLISH_HDD_1: 4.2GB</p>
<p>The size difference should be only a few hundred megabytes at most.</p>
openQA Project - action #98841 (Resolved): qemu randomly fails to start on QA-Power8-5-kvm auto_r...https://progress.opensuse.org/issues/988412021-09-17T15:49:43ZMDouchamartin.doucha@suse.com
<a name="Observation"></a>
<h2 >Observation<a href="#Observation" class="wiki-anchor">¶</a></h2>
<p>A few LTP jobs have failed to start today due to qemu error on QA-Power8-5-kvm worker:<br>
<a href="https://openqa.suse.de/tests/7138972">https://openqa.suse.de/tests/7138972</a><br>
<a href="https://openqa.suse.de/tests/7149857">https://openqa.suse.de/tests/7149857</a><br>
<a href="https://openqa.suse.de/tests/7153989">https://openqa.suse.de/tests/7153989</a></p>
<p>All of them have the following output in autoinst-log.txt:</p>
<pre><code>[2021-09-17T16:24:29.803 CEST] [info] ::: backend::baseclass::die_handler: Backend process died, backend errors are reported below in the following lines:
QEMU terminated before QMP connection could be established. Check for errors below
[2021-09-17T16:24:29.804 CEST] [info] ::: OpenQA::Qemu::Proc::save_state: Saving QEMU state to qemu_state.json
[2021-09-17T16:24:29.805 CEST] [debug] Passing remaining frames to the video encoder
[2021-09-17T16:24:29.805 CEST] [debug] Waiting for video encoder to finalize the video
[2021-09-17T16:24:29.805 CEST] [debug] The built-in video encoder (pid 110385) terminated
[2021-09-17T16:24:29.807 CEST] [debug] QEMU: QEMU emulator version 4.2.1 (openSUSE Leap 15.2)
[2021-09-17T16:24:29.807 CEST] [debug] QEMU: Copyright (c) 2003-2019 Fabrice Bellard and the QEMU Project developers
[2021-09-17T16:24:29.807 CEST] [warn] !!! : qemu-system-ppc64: Failed to allocate KVM HPT of order 25 (try smaller maxmem?): Cannot allocate memory
</code></pre>
<a name="Problem"></a>
<h2 >Problem<a href="#Problem" class="wiki-anchor">¶</a></h2>
<p>QA-Power8-5-kvm has 256GB RAM. <a href="https://monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?viewPanel=12054&orgId=1&from=1631765162464&to=1632085860553">https://monitor.qa.suse.de/d/WDQA-Power8-5-kvm/worker-dashboard-qa-power8-5-kvm?viewPanel=12054&orgId=1&from=1631765162464&to=1632085860553</a> shows that some memory was used during the period when the test failed but nothing that should explain the inability to allocate the memory for the qemu VM. In the system journal there is</p>
<pre><code>Sep 17 16:24:28 QA-Power8-5-kvm worker[88148]: [debug] [pid:88148] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7125263/status
Sep 17 16:24:29 QA-Power8-5-kvm worker[104911]: [info] [pid:108741] sle-15-SP4-ppc64le-Build36.1-HA-BV.qcow2: Processing chunk 501/5812, avg. speed ~976.562 KiB/s
Sep 17 16:24:29 QA-Power8-5-kvm worker[101413]: [debug] [pid:102598] Uploading artefact mq_timedreceive_15-1-2.txt
Sep 17 16:24:29 QA-Power8-5-kvm worker[96737]: [debug] [pid:96737] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7125265/status
Sep 17 16:24:29 QA-Power8-5-kvm worker[109458]: [debug] [pid:110336] Uploading artefact bootloader_start-15.txt
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: 23 callbacks suppressed
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf00, 3cb100) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf04, 3cb104) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf08, 3cb108) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf0c, 3cb10c) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf10, 3cb110) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf14, 3cb114) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf18, 3cb118) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf1c, 3cb11c) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf20, 3cb120) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf24, 3cb124) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -16
Sep 17 16:24:30 QA-Power8-5-kvm worker[88148]: [debug] [pid:88148] Upload concluded (at wait_children)
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [info] [pid:109557] Isotovideo exit status: 1
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [debug] [pid:109557] Stopping job 7153989 from openqa.suse.de: 07153989-sle-15-SP3-Server-DVD-Incidents-Kernel-KOTD-ppc64le-Build5.3.18-302.1.g316993b-ltp_crashme@ppc64le-virtio - reason: died
Sep 17 16:24:30 QA-Power8-5-kvm worker[109557]: [debug] [pid:109557] REST-API call: POST http://openqa.suse.de/api/v1/jobs/7153989/status
Sep 17 16:24:30 QA-Power8-5-kvm worker[101413]: [debug] [pid:102598] Uploading artefact mq_timedreceive_7-1-2.txt
</code></pre>
<p>in particular the messages</p>
<pre><code>Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf20, 3cb120) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: alloc_contig_range: [3caf24, 3cb124) PFNs busy
Sep 17 16:24:29 QA-Power8-5-kvm kernel: cma: cma_alloc: alloc failed, req-size: 512 pages, ret: -16
</code></pre>
<p>so an allocation failure. We could report a bug about this but because KVM on SUSE with Power8 is unsupported so I don't expect any success.</p>
<p>We likely need to accept such issues and trigger a restart automatically by openQA.</p>
<a name="Acceptance-criteria"></a>
<h2 >Acceptance criteria<a href="#Acceptance-criteria" class="wiki-anchor">¶</a></h2>
<ul>
<li><strong>AC1:</strong> qemu ppc64le allocation errors cause automatic job retriggers by openQA</li>
</ul>
<a name="Suggestions"></a>
<h2 >Suggestions<a href="#Suggestions" class="wiki-anchor">¶</a></h2>
<ul>
<li>Catch the error and make it "Incomplete"</li>
<li>Restart the incomplete job</li>
<li>Make openQA automatically detect the issue and trigger restart, e.g. based on <a href="https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini#L76">https://github.com/os-autoinst/openQA/blob/master/etc/openqa/openqa.ini#L76</a></li>
</ul>
openQA Project - action #80996 (Resolved): Implement generic serial terminal over SSHhttps://progress.opensuse.org/issues/809962020-12-11T15:37:05ZMDouchamartin.doucha@suse.com
<p>Quite surprisingly, os-autoinst still has no fully functional SSH serial terminal. It only has multiple VNC-over-SSH console types and one read-only serial terminal for Svirt.</p>
<p>Implement a generic SSH serial terminal that'll provide bidirectional interactive plaintext shell on any backend where the SUT is running SSH server accessible from os-autoinst.</p>
openQA Infrastructure - action #63706 (Rejected): [zkvm] Connection loss between VM and host on o...https://progress.opensuse.org/issues/637062020-02-21T10:13:48ZMDouchamartin.doucha@suse.com
<p>The zkvm slots on openqaworker2 frequently lose VNC and/or SSH connection between the host and VM. The first recent appearance of this problem was on 2020-02-19 around 1AM and affects both SLE-15GA and SLE-15SP1. SLE-12* jobs use different worker class.</p>
<p><a href="https://openqa.suse.de/tests/3898309#step/install_ltp/24" class="external">https://openqa.suse.de/tests/3898309#step/install_ltp/24</a><br>
<a href="https://openqa.suse.de/tests/3898794#step/install_ltp/30" class="external">https://openqa.suse.de/tests/3898794#step/install_ltp/30</a><br>
<a href="https://openqa.suse.de/tests/3906656#step/update_kernel/30" class="external">https://openqa.suse.de/tests/3906656#step/update_kernel/30</a><br>
<a href="https://openqa.suse.de/tests/3909115#step/install_ltp/64" class="external">https://openqa.suse.de/tests/3909115#step/install_ltp/64</a><br>
<a href="https://openqa.suse.de/tests/3898244#step/update_kernel/37" class="external">https://openqa.suse.de/tests/3898244#step/update_kernel/37</a><br>
<a href="https://openqa.suse.de/tests/3906591#step/install_ltp/12" class="external">https://openqa.suse.de/tests/3906591#step/install_ltp/12</a></p>
openQA Infrastructure - action #61994 (Resolved): VNC console corruption on aarch64https://progress.opensuse.org/issues/619942020-01-10T09:46:39ZMDouchamartin.doucha@suse.com
<p>A random problem sometimes appears on aarch64 test machines where the VM screen isn't properly cleared after boot and console output gets drawn over remnants of boot splash screen. Then the job fails because needles don't match. The problem appears less than once a week and job restart usually fixes it but it might be worth investigating further.<br>
<a href="https://openqa.suse.de/tests/3773959#step/update_kernel/6" class="external">https://openqa.suse.de/tests/3773959#step/update_kernel/6</a></p>
openQA Infrastructure - action #61844 (Resolved): auto_review:"download failed: 521 - Connect tim...https://progress.opensuse.org/issues/618442020-01-07T14:21:57ZMDouchamartin.doucha@suse.com
<p>The cache service on openqaworker-arm-3 frequently fails to download assets with error 521:</p>
<pre><code>[2020-01-05T01:30:22.0405 CET] [info] [pid:49324] Downloading SLES-15-aarch64-minimal_installed_for_LTP.qcow2, request #3191 sent to Cache Service
[2020-01-05T01:30:48.0583 CET] [info] [pid:49324] Download of SLES-15-aarch64-minimal_installed_for_LTP.qcow2 processed:
[info] [#3191] Cache size of "/var/lib/openqa/cache" is 49GiB, with limit 50GiB
[info] [#3191] Downloading "SLES-15-aarch64-minimal_installed_for_LTP.qcow2" from "openqa.suse.de/tests/3754531/asset/hdd/SLES-15-aarch64-minimal_installed_for_LTP.qcow2"
[info] [#3191] Purging "/var/lib/openqa/cache/openqa.suse.de/SLES-15-aarch64-minimal_installed_for_LTP.qcow2" because the download failed: 521 - Connect timeout
</code></pre>
<p>The error may seem rare at first glance but that's most likely because of asset caching on workers. For example, of the last 10 jobs on openqaworker-arm-3:19 (at the time of writing), 2 jobs failed with connect timeout, 2 jobs downloaded at least one asset successfully and 6 jobs ran entirely from cache. It's not clear from logs whether the timeout happens during the initial connection or halfway through downloading a 2GB file.<br>
<a href="https://openqa.suse.de/admin/workers/1298" class="external">https://openqa.suse.de/admin/workers/1298</a></p>
<p>The oldest case confirmed by os-autoinst log is from 2019-12-15: <a href="https://openqa.suse.de/tests/3708066" class="external">https://openqa.suse.de/tests/3708066</a><br>
There may have been older cases but their logs have most likely been deleted by now.</p>
<p>I've also looked at 5 instances of openqaworker-arm-1 and found only 3 confirmed cases of the same error. That's low enough to be caused by chance.</p>
openQA Infrastructure - action #58945 (Resolved): OpenQA worker service not restarted after OpenQ...https://progress.opensuse.org/issues/589452019-10-31T13:12:21ZMDouchamartin.doucha@suse.com
<p>The openqa-worker service on some openqa.suse.de workers doesn't get restarted after update. This may cause version mismatch between os-autoinst and openQA-common packages.</p>
<p>One example of this mismatch are these three verification runs for <a href="https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329" class="external">https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/8329</a> below:<br>
openqaworker2: <a href="https://openqa.suse.de/tests/3541705" class="external">https://openqa.suse.de/tests/3541705</a> (openqa-worker service last restarted on 2019-10-30)<br>
openqaworker6: <a href="https://openqa.suse.de/tests/3541697" class="external">https://openqa.suse.de/tests/3541697</a> (openqa-worker service last restarted on 2019-09-18)<br>
openqaworker9: <a href="https://openqa.suse.de/tests/3544337" class="external">https://openqa.suse.de/tests/3544337</a> (openqa-worker service last restarted on 2019-09-18)</p>
<p>All three jobs ran the same test modules (see autoinst log) but all tests after intall_ltp were scheduled at runtime. Updating test schedule at runtime requires patches merged into OpenQA on 2019-09-27 so openqaworker6 and openqaworker9 didn't update test schedule due to still running openQA-common from mid-September, before the patches were merged.</p>
openQA Infrastructure - action #58805 (Resolved): [infra]Severe storage performance issue on open...https://progress.opensuse.org/issues/588052019-10-29T11:34:09ZMDouchamartin.doucha@suse.com
<p>Last week on Thursday, a handful of tests in two LTP testsuites started timing out. I've initially reported it as a kernel performance regression: <a href="https://bugzilla.suse.com/show_bug.cgi?id=1155018" class="external">https://bugzilla.suse.com/show_bug.cgi?id=1155018</a></p>
<p>However, I've tried to reproduce the problem on a released kernel version which didn't have the issue 3 weeks ago and succeeded: <a href="https://openqa.suse.de/tests/overview?build=15ga_mdoucha_bsc_1155018&version=15&distri=sle" class="external">https://openqa.suse.de/tests/overview?build=15ga_mdoucha_bsc_1155018&version=15&distri=sle</a></p>
<p>This successful reproduction on a known good kernel indicates that the problem is somewhere in OpenQA infrastructure, possibly a bug introduced during the weekly deployment on Wednesday, October 23rd. The timeout continues to appear in kernel-of-the-day LTP tests: <a href="https://openqa.suse.de/tests/3533819#step/DOR000/7" class="external">https://openqa.suse.de/tests/3533819#step/DOR000/7</a></p>
<p>Both PPC64LE and x86_64 are affected. Reproducibility on aarch64 and s390 is currently unknown because we don't run the affected testsuites on those two platforms. The failing tests mostly belong to the async & direct I/O stress testsuite.</p>