action #76774: auto-review/openqa-label-known-issues does not conclude within GitLab CI - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #76774

closed

openQA Project (public) - coordination #102915: [saga][epic] Automated classification of failures

openQA Project (public) - coordination #166655: [epic] openqa-label-known-issues

auto-review/openqa-label-known-issues does not conclude within GitLab CI

Added by mkittler over 4 years ago. Updated 6 months ago.

Status:

Resolved

Priority:

High

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2020-10-29

Due date:

% Done:

Estimated time:

Description

Observation¶

The issue has persisted for 2 days now and re-triggering didn't help: https://gitlab.suse.de/openqa/auto-review/-/pipelines
I've attached an example log file.

In particular, the jobs "monitor and label incomplete jobs - dry run label" (1) and "monitor and label incomplete jobs - osd" (2) fail. I assume (1) is about o3 and (2) is about OSD.

Acceptance criteria¶

AC1: auto-review gitlab CI pipeline jobs can finish in time, i.e. do not run into timeout

Suggestions¶

Investigate the CI jobs

Additional notes¶

I've already tried to reproduce the issue locally. Yesterday and today there were not a big number of incomplete jobs found by openqa-monitor-incompletes. That counts for o3 and OSD. So the problem is not just caused by too many incomplete jobs.
Likely some openqa-client call hangs.
I wasn't able to conduct a non-dry run locally where it would actually label some jobs. It is unclear how to reproduce the issue outside of the CI and whether the issue is CI-specific or not.

Files

job.log (299 KB) job.log

mkittler, 2020-10-29 11:23

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz over 4 years ago

Description updated (diff)
Status changed from New to Workable
Priority changed from Normal to Urgent
Target version set to Ready

In particular, the jobs "monitor and label incomplete jobs - dry run label" (1) and "monitor and label incomplete jobs - osd" (2) fail. I assume (1) is about o3 and (2) is about OSD.

hm, if it would have been only for osd I would suspected the issue to be related to #73633 but as o3 seems to be also involved I wonder if the client being slow to react is something that is linked to an update of dependencies that we might have received

Actually no, "monitor and label incomplete jobs - dry run label" is also about osd but just "dry run", i.e. there should not be any writing call over the openQA API. In both the pipeline from yesterday and today I found that both the "dry run" and actual run for osd failed but passed for o3. So it is strange that the dry run already fails but as this is OSD specific this might still be about #73633

Still, it is not like all openqa-client calls would fail, e.g. https://gitlab.suse.de/openqa/auto-review/-/jobs/277707#L132 shows

++ openqa-client --json-output --host https://openqa.suse.de jobs/4899635

which is an actual call to a real-life job on osd and that seems to work fine.

But comparing the "last good" dry run log from https://gitlab.suse.de/openqa/auto-review/-/jobs/277063 with "first bad" https://gitlab.suse.de/openqa/auto-review/-/jobs/277707 . In the "last good" case I see for example:

+ id=4892526
+ search_term='(?s)([dD]ownload.*failed.*404).*Result: setup failure'
+ comment='label:non_existing asset, candidate for removal or wrong settings'
+ restart=
+ grep_opts=-qPzo
+ check=
+ [[ -n '' ]]
+ grep -qPzo '(?s)([dD]ownload.*failed.*404).*Result: setup failure' /tmp/tmp.4UvfFy39X0
+ echo openqa-client --host https://openqa.suse.de jobs/4892526/comments post 'text=label:non_existing asset, candidate for removal or wrong settings'
+ '[' '' = 1 ']'
+ return
+ return
+ for i in $(cat - | sed 's/ .*$//')
+ investigate_issue https://openqa.suse.de/tests/4889562
+ local id=4889562
+ local out
+ local reason
+ local curl
++ openqa-client --json-output --host https://openqa.suse.de jobs/4889562
++ jq -r .job.reason
+ reason='backend died: qemu-img: Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': No such file or directory'
+ '[' -n 'backend died: qemu-img: Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': No such file or directory' ']'
+ '[' 257 -lt 8 ']'

and in the "first bad" the corresponding log excerpt is:

+ id=4906221
+ search_term='(?s)([dD]ownload.*failed.*404).*Result: setup failure'
+ comment='label:non_existing asset, candidate for removal or wrong settings'
+ restart=
+ grep_opts=-qPzo
+ check=
+ [[ -n '' ]]
+ grep -qPzo '(?s)([dD]ownload.*failed.*404).*Result: setup failure' /tmp/tmp.4EuvKRwfVh
...............

so after the grep there should be echo openqa-client --host https://openqa.s… but there is nothing.

Actions

Copy link

Updated by mkittler over 4 years ago

Priority changed from Urgent to High

It seems to work again. So maybe it is really just caused by OSD being unresponsive. It would make sense to have a better timeout in the client then.

Actually no, "monitor and label incomplete jobs - dry run label" is also about osd but just "dry run"

I assumed it was about o3 because there are no environment variables set in "monitor and label incomplete jobs - dry run label" and the default host is o3. But apparently the default is overridden by TARGET in .gitlab-ci.yml (wherever TARGET comes from). Maybe the names should be clearer because the reverse engineering is error prone and annoying.

Actions

Copy link

Updated by okurz over 4 years ago

Related to action #77101: fix selection of gitlab CI runners added

Actions

Copy link

Updated by okurz over 4 years ago

Status changed from Workable to Resolved
Assignee set to okurz

I assume this was somewhat related to #77101 and changes within the gitlab CI runner infrastructure by EngInfra. I have not observed that problem since some days, auto-review seems to be ok regarding runtimes.

Actions

Copy link

Updated by okurz 6 months ago

Parent task set to #166655

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #76774

auto-review/openqa-label-known-issues does not conclude within GitLab CI

Observation¶

Acceptance criteria¶

Suggestions¶

Additional notes¶

Updated by okurz over 4 years ago

Updated by mkittler over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz over 4 years ago

Updated by okurz 6 months ago