action #178492: [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #178492

closed

openQA Project (public) - coordination #154777: [saga][epic] Shareable os-autoinst and test distribution plugins

openQA Project (public) - coordination #162131: [epic] future version control related features in openQA

[alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S

Added by mkittler 3 months ago. Updated about 2 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

robert.richardson

Category:

Regressions/Crashes

Target version:

openQA Project (public) - Ready

Start date:

2025-03-07

Due date:

% Done:

Estimated time:

Tags:

alert, osd, infra, reactive work

Description

Observation¶

We got many failing git_clone Minion jobs (for at least several hours) on OSD. I briefly checked some of them and they failed with the following common error message:

result: "Error detecting remote default branch name for \"gitlab@gitlab.suse.de:openqa/os-autoinst-needles-sles.git\":
  \ remote: \nremote: ========================================================================\nremote:
  \nremote: Internal API unreachable\nremote: \nremote: ========================================================================\nremote:
  \nfatal: Could not read from remote repository.\n\nPlease make sure you have the
  correct access rights\nand the repository exists. at /usr/share/openqa/script/../lib/OpenQA/Git.pm
  line 147.\n"

See https://openqa.suse.de/minion/jobs?task=git_clone&state=failed and https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&orgId=1 and https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1 for details.

@mkittler cleaned up the Minion dashboard so the link above doesn't show the failures anymore. The oldest jobs were from 17 days ago.

This ticket is for handling this specific outage, the long-term solution to make openQA resilient to transient outages is covered in Feature Request #179038.

Suggestions¶

Investigate if there were incomplete jobs due to this.
Consider adding a regex to the ticket subject
Damage is likely limited. If we can't sync needles nobody can edit needles.
Ensure https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=git_clone is empty

Rollback actions¶

Remove silence alertname=web UI: Too many Minion job failures alert from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by mkittler 3 months ago

Description updated (diff)

Actions

Copy link

Updated by okurz 3 months ago

Tags changed from infra, alert to infra, alert, reactive work, osd
Category set to Regressions/Crashes
Priority changed from Normal to High
Target version set to Ready

Actions

Copy link

Updated by livdywan 3 months ago

Subject changed from [alert] Many failing `git_clone` Minion jobs to [alert] Many failing `git_clone` Minion jobs size:S
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by robert.richardson 3 months ago

Status changed from Workable to In Progress
Assignee set to robert.richardson

Actions

Copy link

Updated by openqa_review 3 months ago

Due date set to 2025-03-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by robert.richardson 3 months ago

Subject changed from [alert] Many failing `git_clone` Minion jobs size:S to [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S

Actions

Copy link

Updated by robert.richardson 3 months ago · Edited

@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like

As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.

Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?

Actions

Copy link

Updated by okurz 3 months ago

robert.richardson wrote in #note-7:

@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like
As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.
Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?

Yes, you can take over the AC and suggestions from this ticket for the new openQA feature request and revert this ticket back to just what is relevant for the "infra" part. I set priority to "High" as at the time the issue was and still should be an alert reaction ticket. Now we try to pack a feature request into it and that should be handled explicitly separately.

Actions

Copy link

Updated by robert.richardson 3 months ago

Related to action #179038: Gracious handling of longer remote git clones outages size:S added

Actions

Copy link

#10

Updated by robert.richardson 3 months ago

Description updated (diff)

Actions

Copy link

#11

Updated by robert.richardson 3 months ago

Description updated (diff)

Actions

Copy link

#12

Updated by livdywan 3 months ago

Looking at interval=60 ./openqa-query-for-job-l abel "poo#178492" is not revealing any jobs. The default timeframe is 30 days. Maybe none of the jobs were labelled with the ticket?

We don't have any known affected job here so it's unclear if we are missing jobs. The minion job has no job ID's.

Actions

Copy link

#13

Updated by robert.richardson 3 months ago

Thanks again for the help, i also followed the scripts repos readme and ran ./openqa-review-failed prior, hoping it would label any relevant jobs, but as @livdywan mentioned ./openqa-query-for-job-label "poo#178492" does not output any jobs at all. As we cant link the according tests, the minion jobs have been cleaned and the failures happened almost a month ago, i'm resolving this ticket as discussed in the daily.

Note: I've created the related feature request for a proper outage handling mechanism in #179038

Actions

Copy link

#14