Project

General

Profile

Actions

action #76774

closed

openQA Project (public) - coordination #102915: [saga][epic] Automated classification of failures

openQA Project (public) - coordination #166655: [epic] openqa-label-known-issues

auto-review/openqa-label-known-issues does not conclude within GitLab CI

Added by mkittler about 4 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Start date:
2020-10-29
Due date:
% Done:

0%

Estimated time:

Description

Observation

The issue has persisted for 2 days now and re-triggering didn't help: https://gitlab.suse.de/openqa/auto-review/-/pipelines
I've attached an example log file.

In particular, the jobs "monitor and label incomplete jobs - dry run label" (1) and "monitor and label incomplete jobs - osd" (2) fail. I assume (1) is about o3 and (2) is about OSD.

Acceptance criteria

  • AC1: auto-review gitlab CI pipeline jobs can finish in time, i.e. do not run into timeout

Suggestions

  • Investigate the CI jobs

Additional notes

  • I've already tried to reproduce the issue locally. Yesterday and today there were not a big number of incomplete jobs found by openqa-monitor-incompletes. That counts for o3 and OSD. So the problem is not just caused by too many incomplete jobs.
  • Likely some openqa-client call hangs.
  • I wasn't able to conduct a non-dry run locally where it would actually label some jobs. It is unclear how to reproduce the issue outside of the CI and whether the issue is CI-specific or not.

Files

job.log (299 KB) job.log mkittler, 2020-10-29 11:23

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure (public) - action #77101: fix selection of gitlab CI runnersResolvedokurz2020-11-07

Actions
Actions #1

Updated by okurz about 4 years ago

  • Description updated (diff)
  • Status changed from New to Workable
  • Priority changed from Normal to Urgent
  • Target version set to Ready

In particular, the jobs "monitor and label incomplete jobs - dry run label" (1) and "monitor and label incomplete jobs - osd" (2) fail. I assume (1) is about o3 and (2) is about OSD.

hm, if it would have been only for osd I would suspected the issue to be related to #73633 but as o3 seems to be also involved I wonder if the client being slow to react is something that is linked to an update of dependencies that we might have received

Actually no, "monitor and label incomplete jobs - dry run label" is also about osd but just "dry run", i.e. there should not be any writing call over the openQA API. In both the pipeline from yesterday and today I found that both the "dry run" and actual run for osd failed but passed for o3. So it is strange that the dry run already fails but as this is OSD specific this might still be about #73633

Still, it is not like all openqa-client calls would fail, e.g. https://gitlab.suse.de/openqa/auto-review/-/jobs/277707#L132 shows

++ openqa-client --json-output --host https://openqa.suse.de jobs/4899635

which is an actual call to a real-life job on osd and that seems to work fine.

But comparing the "last good" dry run log from https://gitlab.suse.de/openqa/auto-review/-/jobs/277063 with "first bad" https://gitlab.suse.de/openqa/auto-review/-/jobs/277707 . In the "last good" case I see for example:

+ id=4892526
+ search_term='(?s)([dD]ownload.*failed.*404).*Result: setup failure'
+ comment='label:non_existing asset, candidate for removal or wrong settings'
+ restart=
+ grep_opts=-qPzo
+ check=
+ [[ -n '' ]]
+ grep -qPzo '(?s)([dD]ownload.*failed.*404).*Result: setup failure' /tmp/tmp.4UvfFy39X0
+ echo openqa-client --host https://openqa.suse.de jobs/4892526/comments post 'text=label:non_existing asset, candidate for removal or wrong settings'
+ '[' '' = 1 ']'
+ return
+ return
+ for i in $(cat - | sed 's/ .*$//')
+ investigate_issue https://openqa.suse.de/tests/4889562
+ local id=4889562
+ local out
+ local reason
+ local curl
++ openqa-client --json-output --host https://openqa.suse.de jobs/4889562
++ jq -r .job.reason
+ reason='backend died: qemu-img: Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': No such file or directory'
+ '[' -n 'backend died: qemu-img: Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': Could not open '\''sle-15-SP3-aarch64-68.2@aarch64-virtio-minimal_with_sdk68.2_installed-uefi-vars.qcow2'\'': No such file or directory' ']'
+ '[' 257 -lt 8 ']'

and in the "first bad" the corresponding log excerpt is:

+ id=4906221
+ search_term='(?s)([dD]ownload.*failed.*404).*Result: setup failure'
+ comment='label:non_existing asset, candidate for removal or wrong settings'
+ restart=
+ grep_opts=-qPzo
+ check=
+ [[ -n '' ]]
+ grep -qPzo '(?s)([dD]ownload.*failed.*404).*Result: setup failure' /tmp/tmp.4EuvKRwfVh
...............

so after the grep there should be echo openqa-client --host https://openqa.s… but there is nothing.

Actions #2

Updated by mkittler about 4 years ago

  • Priority changed from Urgent to High

It seems to work again. So maybe it is really just caused by OSD being unresponsive. It would make sense to have a better timeout in the client then.


Actually no, "monitor and label incomplete jobs - dry run label" is also about osd but just "dry run"

I assumed it was about o3 because there are no environment variables set in "monitor and label incomplete jobs - dry run label" and the default host is o3. But apparently the default is overridden by TARGET in .gitlab-ci.yml (wherever TARGET comes from). Maybe the names should be clearer because the reverse engineering is error prone and annoying.

Actions #3

Updated by okurz about 4 years ago

  • Related to action #77101: fix selection of gitlab CI runners added
Actions #4

Updated by okurz about 4 years ago

  • Status changed from Workable to Resolved
  • Assignee set to okurz

I assume this was somewhat related to #77101 and changes within the gitlab CI runner infrastructure by EngInfra. I have not observed that problem since some days, auto-review seems to be ok regarding runtimes.

Actions #5

Updated by okurz 3 months ago

  • Parent task set to #166655
Actions

Also available in: Atom PDF