action #64917
closedcoordination #39719: [saga][epic] Detection of "known failures" for stable tests, easy test results review and easy tracking of known issues
coordination #62420: [epic] Distinguish all types of incompletes
auto_review:"(?s)qemu-img.*runcmd.*failed with exit code 1" sometimes but no apparent error message
0%
Description
Observation¶
https://openqa.opensuse.org/tests/1214063 is incomplete with
[2020-03-26T04:54:21.454 CET] [debug] running /usr/bin/qemu-img create -f qcow2 /var/lib/openqa/pool/17/raid/hd0 20G
[2020-03-26T04:54:21.478 CET] [debug] Formatting '/var/lib/openqa/pool/17/raid/hd0', fmt=qcow2 size=21474836480 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2020-03-26T04:54:21.478 CET] [debug] Backend process died, backend errors are reported below in the following lines:
runcmd failed with exit code 1 at /usr/lib/os-autoinst/osutils.pm line 121.
as reported by https://gitlab.suse.de/openqa/auto-review/-/jobs/186291
The system journal on the assigned worker machine openqaworker4 did not provide any more information. Seems like the command qemu-img create …
was providing the output "Formatting …" but then failed silently with exit code 1 but no apparent error message. When executing a similar command manually this works fine and if one triggers an error, e.g. with incorrect parameters, incorrect permissions, etc., I see useful error messages
Acceptance criteria¶
- AC1: In case a failure from the qemu-img command can not be prevented we have a helpful message
Suggestions¶
- Maybe we can just retry the qemu-img command some times
Files
Updated by okurz over 4 years ago
- Copied from action #64854: qemu-img error message is incorrectly tried to be parsed as JSON auto_review:"malformed JSON string" added
Updated by okurz over 4 years ago
- Copied from deleted (action #64854: qemu-img error message is incorrectly tried to be parsed as JSON auto_review:"malformed JSON string")
Updated by okurz over 4 years ago
- Priority changed from Normal to High
I have the feeling this is a problem only showing up recently, e.g. again in https://openqa.opensuse.org/tests/1218912
[2020-03-30T13:59:52.476 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/openSUSE-13.1-gnome64.qcow2 /var/lib/openqa/pool/7/raid/hd0-overlay0 42949672960
[2020-03-30T13:59:52.497 CEST] [debug] Formatting '/var/lib/openqa/pool/7/raid/hd0-overlay0', fmt=qcow2 size=42949672960 backing_file=/var/lib/openqa/pool/7/openSUSE-13.1-gnome64.qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2020-03-30T13:59:52.497 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/7/openSUSE-Tumbleweed-NET-x86_64-Snapshot20200329-Media.iso /var/lib/openqa/pool/7/raid/cd0-overlay0 143654912
[2020-03-30T13:59:52.520 CEST] [debug] Formatting '/var/lib/openqa/pool/7/raid/cd0-overlay0', fmt=qcow2 size=143654912 backing_file=/var/lib/openqa/pool/7/openSUSE-Tumbleweed-NET-x86_64-Snapshot20200329-Media.iso cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2020-03-30T13:59:52.521 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
runcmd failed with exit code 1 at /usr/lib/os-autoinst/osutils.pm line 121.
Updated by okurz over 4 years ago
- Subject changed from qemu-img create sometimes fails with exit code 1 but no apparent error message to auto_review:"(?s)qemu-img.*runcmd.*failed with exit code 1" sometimes but no apparent error message
- Status changed from Workable to Feedback
- Assignee set to okurz
I have one (related) idea for improvement. The error message could contain the command that failed. It's a best practice for log messages to be complete by themselves so we should not rely on any log messages that came in before.
https://github.com/os-autoinst/os-autoinst/pull/1391
I crosschecked if an error message from osutils would actually be shown. At least when running isotovideo -d
locally in a shell on a simulated error condition I could see an error message clearly, e.g. I did
sudo chown root.root raid/
sudo chmod o-w raid/
isotovideo -d
which showed:
[2020-04-30T21:10:23.781 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/share/factory/hdd/opensuse-Tumbleweed-x86_64-textmode@64bit.qcow2 …/raid/hd0-overlay0 0
[2020-04-30T21:10:23.796 CEST] [debug] qemu-img: …/raid/hd0-overlay0: Could not create file: Permission denied
Formatting '…/raid/hd0-overlay0', fmt=qcow2 size=0 backing_file=/var/lib/openqa/share/factory/hdd/opensuse-Tumbleweed-x86_64-textmode@64bit.qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2020-04-30T21:10:23.796 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
runcmd '/usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/share/factory/hdd/opensuse-Tumbleweed-x86_64-textmode@64bit.qcow2 …/raid/hd0-overlay0 0' failed with exit code 1 at ../../osutils.pm line 122.
so the error message is shown. In the case of this ticket I could imagine either the stderr output is lost within openQA context or qemu-img actually fails silently with no output at all.
Updated by okurz over 4 years ago
- File journal_poo64917.log journal_poo64917.log added
- Status changed from Feedback to In Progress
PR was merged and is deployed on o3. https://openqa.opensuse.org/tests/1251581 is one candidate that shows the same error with new output:
[2020-05-02T08:13:47.143 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/13/opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2 /var/lib/openqa/pool/13/raid/hd0-overlay0 32212254720
[2020-05-02T08:14:03.797 CEST] [debug] Formatting '/var/lib/openqa/pool/13/raid/hd0-overlay0', fmt=qcow2 size=32212254720 backing_file=/var/lib/openqa/pool/13/opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2020-05-02T08:14:04.774 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
runcmd '/usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/13/opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2 /var/lib/openqa/pool/13/raid/hd0-overlay0 32212254720' failed with exit code 1 at /usr/lib/os-autoinst/osutils.pm line 121.
I have the journal from the worker machine from this time period attached mentioning the asset "opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2" multiple times.
Before the test starts I see:
May 02 08:11:32 openqa-aarch64 worker[2815]: [info] Download of opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2 processed:
May 02 08:11:32 openqa-aarch64 worker[2815]: [info] [#2713] Cache size of "/var/lib/openqa/cache" is 87GiB, with limit 200GiB
May 02 08:11:32 openqa-aarch64 worker[2815]: [info] [#2713] Downloading "opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2" from "http://openqa1-opensuse/tests/1251578/asset/hdd/opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2"
May 02 08:11:32 openqa-aarch64 worker[2815]: [info] [#2713] Content of "/var/lib/openqa/cache/openqa1-opensuse/opensuse-15.2-aarch64-159.1-textmode@aarch64.qcow2" has not changed, updating last use
so the file had been downloaded in before, it is just updated within the cache db. Also the cache has more than 100GiB free so no risk of exceeding the needed space.
After that there is
May 02 08:11:48 openqa-aarch64 worker[2823]: [info] 4028: WORKING 1251581
May 02 08:11:48 openqa-aarch64 worker[2817]: [info] 4027: WORKING 1251586
May 02 08:11:48 openqa-aarch64 worker[2819]: [info] 4031: WORKING 1251579
May 02 08:11:48 openqa-aarch64 worker[2815]: [info] Using cgroup /sys/fs/cgroup/systemd/openqa.slice/openqa-worker.slice/openqa-worker@6.service/1251578
May 02 08:11:48 openqa-aarch64 worker[2815]: [info] Starting isotovideo container
May 02 08:11:49 openqa-aarch64 worker[2815]: [info] isotovideo has been started (PID: 4037)
May 02 08:11:50 openqa-aarch64 worker[2815]: [info] 4037: WORKING 1251578
May 02 08:11:53 openqa-aarch64 worker[2818]: [info] 4023: WORKING 1251588
May 02 08:12:41 openqa-aarch64 nscd[1408]: 1408 monitored file `/etc/resolv.conf` changed (mtime)
May 02 08:12:41 openqa-aarch64 nscd[1408]: 1408 monitoring file `/etc/resolv.conf` (7)
May 02 08:12:41 openqa-aarch64 nscd[1408]: 1408 monitoring directory `/etc` (2)
May 02 08:14:55 openqa-aarch64 worker[2823]: [info] Isotovideo exit status: 1
May 02 08:14:57 openqa-aarch64 worker[2823]: [info] +++ worker notes +++
May 02 08:14:57 openqa-aarch64 worker[2823]: [info] End time: 2020-05-02 06:14:57
that is the current job 1251581 starting to work, other jobs and irrelevant entries and then the current job for the same worker (pid 2823) reporting the completion with failure. So no other output at all, no action within the cache at all, etc.
Other ideas I have: Run the qemu-img command with multiple retries or enable debugging and potentially parse the output file in case qemu-img fails. But I still have the suspicion that we are loosing output somewhere … Trying to add something with local tests.
Updated by okurz over 4 years ago
- Status changed from In Progress to Feedback
Updated by okurz over 4 years ago
was merged and deployed.
Using SQL select id,test from jobs where result='incomplete' and reason ~ 'died' and t_finished > '2020-05-03 17:00' order by id asc;
I found one relevant job https://openqa.opensuse.org/tests/1259511 with
[2020-05-08T10:24:20.043 CEST] [debug] running /usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/8/GNOME_Next.x86_64-3.36.2-Build13.208.iso /var/lib/openqa/pool/8/raid/usbstick-overlay0 1148911616
[2020-05-08T10:24:20.065 CEST] [debug] Formatting '/var/lib/openqa/pool/8/raid/usbstick-overlay0', fmt=qcow2 size=1148911616 backing_file=/var/lib/openqa/pool/8/GNOME_Next.x86_64-3.36.2-Build13.208.iso cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2020-05-08T10:24:20.065 CEST] [debug] Backend process died, backend errors are reported below in the following lines:
runcmd '/usr/bin/qemu-img create -f qcow2 -b /var/lib/openqa/pool/8/GNOME_Next.x86_64-3.36.2-Build13.208.iso /var/lib/openqa/pool/8/raid/usbstick-overlay0 1148911616' failed with exit code 1: 'Formatting '/var/lib/openqa/pool/8/raid/usbstick-overlay0', fmt=qcow2 size=1148911616 backing_file=/var/lib/openqa/pool/8/GNOME_Next.x86_64-3.36.2-Build13.208.iso cluster_size=65536 lazy_refcounts=off refcount_bits=16' at /usr/lib/os-autoinst/osutils.pm line 121.
the contained message is actually not an error message but the progress output, hm. Checked logs on the worker as in before and could find that the asset should have still been present at this time. There was an explicit prune message some time later. So I do not have a better idea than just retrying multiple times here: https://github.com/os-autoinst/os-autoinst/pull/1399
Updated by okurz over 4 years ago
- Due date set to 2020-05-25
Haven't found any jobs failing in the same step. So maybe we have the incompletes averted but struggle to find now where the retry is applied.
Looking for references in test results on o3:
/var/lib/openqa/testresults> ionice -c 3 find 0127* -type f -name autoinst-log.txt -exec grep 'init_blockdev_images.* failed' '{}' \;
EDIT: Found none. I will just check at a later time if there are any incompletes. If we do not find any I will assume that retries prevent the problem
Updated by okurz over 4 years ago
- Due date changed from 2020-05-25 to 2020-06-09
- Priority changed from High to Low
With #64857 in place we can now more simply query the DB directly, e.g.
select jobs.id, t_started, workers.host as worker_host, workers.instance as worker_instance, reason from jobs join workers on assigned_worker_id=workers.id where result = 'incomplete' and reason ~ 'died.*qemu-img';
this currently yields a single job
https://openqa.opensuse.org/tests/1277401
which does not look like it applies the retries. For some reason it seems like we do not have that problem that often anymore. Let's give it some more days before checking again.
Updated by okurz over 4 years ago
- Status changed from Feedback to Resolved
Checked o3 database again and have not found this problem again since 2020-05-25
Updated by okurz over 3 years ago
- Related to action #81828: Jobs run into timeout_exceeded after the 'chattr' call, no output until timeout, auto_review:"(?s)Refusing to save an empty state file to avoid overwriting a useful one.*Result: timeout":retry added