action #76774
closedopenQA Project (public) - coordination #102915: [saga][epic] Automated classification of failures
openQA Project (public) - coordination #166655: [epic] openqa-label-known-issues
auto-review/openqa-label-known-issues does not conclude within GitLab CI
0%
Description
Observation¶
The issue has persisted for 2 days now and re-triggering didn't help: https://gitlab.suse.de/openqa/auto-review/-/pipelines
I've attached an example log file.
In particular, the jobs "monitor and label incomplete jobs - dry run label" (1) and "monitor and label incomplete jobs - osd" (2) fail. I assume (1) is about o3 and (2) is about OSD.
Acceptance criteria¶
- AC1: auto-review gitlab CI pipeline jobs can finish in time, i.e. do not run into timeout
Suggestions¶
- Investigate the CI jobs
Additional notes¶
- I've already tried to reproduce the issue locally. Yesterday and today there were not a big number of incomplete jobs found by
openqa-monitor-incompletes
. That counts for o3 and OSD. So the problem is not just caused by too many incomplete jobs. - Likely some
openqa-client
call hangs. - I wasn't able to conduct a non-dry run locally where it would actually label some jobs. It is unclear how to reproduce the issue outside of the CI and whether the issue is CI-specific or not.
Files
Updated by okurz about 4 years ago
- Description updated (diff)
- Status changed from New to Workable
- Priority changed from Normal to Urgent
- Target version set to Ready
In particular, the jobs "monitor and label incomplete jobs - dry run label" (1) and "monitor and label incomplete jobs - osd" (2) fail. I assume (1) is about o3 and (2) is about OSD.
hm, if it would have been only for osd I would suspected the issue to be related to #73633 but as o3 seems to be also involved I wonder if the client being slow to react is something that is linked to an update of dependencies that we might have received
Actually no, "monitor and label incomplete jobs - dry run label" is also about osd but just "dry run", i.e. there should not be any writing call over the openQA API. In both the pipeline from yesterday and today I found that both the "dry run" and actual run for osd failed but passed for o3. So it is strange that the dry run already fails but as this is OSD specific this might still be about #73633
Still, it is not like all openqa-client
calls would fail, e.g. https://gitlab.suse.de/openqa/auto-review/-/jobs/277707#L132 shows
++ openqa-client --json-output --host https://openqa.suse.de jobs/4899635
which is an actual call to a real-life job on osd and that seems to work fine.
But comparing the "last good" dry run log from https://gitlab.suse.de/openqa/auto-review/-/jobs/277063 with "first bad" https://gitlab.suse.de/openqa/auto-review/-/jobs/277707 . In the "last good" case I see for example:
+ id=4892526
+ search_term='(?s)([dD]ownload.*failed.*404).*Result: setup failure'
+ comment='label:non_existing asset, candidate for removal or wrong settings'
+ restart=
+ grep_opts=-qPzo
+ check=
+ [[ -n '' ]]
+ grep -qPzo '(?s)([dD]ownload.*failed.*404).*Result: setup failure' /tmp/tmp.4UvfFy39X0
+ echo openqa-client --host https://openqa.suse.de jobs/4892526/comments post 'text=label:non_existing asset, candidate for removal or wrong settings'
+ '[' '' = 1 ']'
+ return
+ return
+ for i in $(cat - | sed 's/ .*$//')
+ investigate_issue https://openqa.suse.de/tests/4889562
+ local id=4889562
+ local out
+ local reason
+ local curl
++ openqa-client --json-output --host https://openqa.suse.de jobs/4889562
++ jq -r .job.reason
+ reason='backend died: qemu-img: Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': No such file or directory'
+ '[' -n 'backend died: qemu-img: Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': No such file or directory' ']'
+ '[' 257 -lt 8 ']'
and in the "first bad" the corresponding log excerpt is:
+ id=4906221
+ search_term='(?s)([dD]ownload.*failed.*404).*Result: setup failure'
+ comment='label:non_existing asset, candidate for removal or wrong settings'
+ restart=
+ grep_opts=-qPzo
+ check=
+ [[ -n '' ]]
+ grep -qPzo '(?s)([dD]ownload.*failed.*404).*Result: setup failure' /tmp/tmp.4EuvKRwfVh
...............
so after the grep
there should be echo openqa-client --host https://openqa.s…
but there is nothing.
Updated by mkittler about 4 years ago
- Priority changed from Urgent to High
It seems to work again. So maybe it is really just caused by OSD being unresponsive. It would make sense to have a better timeout in the client then.
Actually no, "monitor and label incomplete jobs - dry run label" is also about osd but just "dry run"
I assumed it was about o3 because there are no environment variables set in "monitor and label incomplete jobs - dry run label" and the default host is o3. But apparently the default is overridden by TARGET
in .gitlab-ci.yml
(wherever TARGET
comes from). Maybe the names should be clearer because the reverse engineering is error prone and annoying.
Updated by okurz about 4 years ago
- Related to action #77101: fix selection of gitlab CI runners added
Updated by okurz about 4 years ago
- Status changed from Workable to Resolved
- Assignee set to okurz
I assume this was somewhat related to #77101 and changes within the gitlab CI runner infrastructure by EngInfra. I have not observed that problem since some days, auto-review seems to be ok regarding runtimes.