Project

General

Profile

Actions

action #178492

closed

openQA Project (public) - coordination #154777: [saga][epic] Shareable os-autoinst and test distribution plugins

openQA Project (public) - coordination #162131: [epic] future version control related features in openQA

[alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S

Added by mkittler about 2 months ago. Updated 8 days ago.

Status:
Resolved
Priority:
Normal
Category:
Regressions/Crashes
Start date:
2025-03-07
Due date:
% Done:

0%

Estimated time:

Description

Observation

We got many failing git_clone Minion jobs (for at least several hours) on OSD. I briefly checked some of them and they failed with the following common error message:

result: "Error detecting remote default branch name for \"gitlab@gitlab.suse.de:openqa/os-autoinst-needles-sles.git\":
  \ remote: \nremote: ========================================================================\nremote:
  \nremote: Internal API unreachable\nremote: \nremote: ========================================================================\nremote:
  \nfatal: Could not read from remote repository.\n\nPlease make sure you have the
  correct access rights\nand the repository exists. at /usr/share/openqa/script/../lib/OpenQA/Git.pm
  line 147.\n"

See https://openqa.suse.de/minion/jobs?task=git_clone&state=failed and https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&orgId=1 and https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1 for details.

@mkittler cleaned up the Minion dashboard so the link above doesn't show the failures anymore. The oldest jobs were from 17 days ago.

This ticket is for handling this specific outage, the long-term solution to make openQA resilient to transient outages is covered in Feature Request #179038.

Suggestions

Rollback actions


Related issues 2 (1 open1 closed)

Related to openQA Project (public) - action #179038: Gracious handling of longer remote git clones outages size:SResolvedrobert.richardson2025-03-172025-04-11

Actions
Related to openQA Infrastructure (public) - action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:SIn Progresslivdywan2025-04-14

Actions
Actions #1

Updated by mkittler about 2 months ago

  • Description updated (diff)
Actions #2

Updated by okurz about 2 months ago

  • Tags changed from infra, alert to infra, alert, reactive work, osd
  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Ready
Actions #3

Updated by livdywan about 1 month ago

  • Subject changed from [alert] Many failing `git_clone` Minion jobs to [alert] Many failing `git_clone` Minion jobs size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by robert.richardson about 1 month ago

  • Status changed from Workable to In Progress
  • Assignee set to robert.richardson
Actions #5

Updated by openqa_review about 1 month ago

  • Due date set to 2025-03-26

Setting due date based on mean cycle time of SUSE QE Tools

Actions #6

Updated by robert.richardson about 1 month ago

  • Subject changed from [alert] Many failing `git_clone` Minion jobs size:S to [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S
Actions #7

Updated by robert.richardson about 1 month ago · Edited

@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like

As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.

Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?

Actions #8

Updated by okurz about 1 month ago

robert.richardson wrote in #note-7:

@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like

As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.

Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?

Yes, you can take over the AC and suggestions from this ticket for the new openQA feature request and revert this ticket back to just what is relevant for the "infra" part. I set priority to "High" as at the time the issue was and still should be an alert reaction ticket. Now we try to pack a feature request into it and that should be handled explicitly separately.

Actions #9

Updated by robert.richardson about 1 month ago

  • Related to action #179038: Gracious handling of longer remote git clones outages size:S added
Actions #10

Updated by robert.richardson about 1 month ago

  • Description updated (diff)
Actions #11

Updated by robert.richardson about 1 month ago

  • Description updated (diff)
Actions #12

Updated by livdywan about 1 month ago

Looking at interval=60 ./openqa-query-for-job-l
abel "poo#178492"
is not revealing any jobs. The default timeframe is 30 days. Maybe none of the jobs were labelled with the ticket?

We don't have any known affected job here so it's unclear if we are missing jobs. The minion job has no job ID's.

Actions #13

Updated by robert.richardson about 1 month ago

Thanks again for the help, i also followed the scripts repos readme and ran ./openqa-review-failed prior, hoping it would label any relevant jobs, but as @livdywan mentioned ./openqa-query-for-job-label "poo#178492" does not output any jobs at all. As we cant link the according tests, the minion jobs have been cleaned and the failures happened almost a month ago, i'm resolving this ticket as discussed in the daily.

Note: I've created the related feature request for a proper outage handling mechanism in #179038

Actions #14

Updated by robert.richardson about 1 month ago

  • Status changed from In Progress to Resolved
Actions #15

Updated by okurz about 1 month ago

  • Due date deleted (2025-03-26)
  • Parent task set to #162131
Actions #16

Updated by okurz about 1 month ago

  • Description updated (diff)
  • Status changed from Resolved to Blocked
  • Priority changed from High to Normal

Issue happened again. Added silence and according rollback actions. Now we can wait for #179038

Actions #17

Updated by livdywan 8 days ago

  • Status changed from Blocked to Workable

Blocked resolved

Actions #18

Updated by robert.richardson 8 days ago · Edited

I removed the silence, as the PR regarding git server outage handling is merged, i think we can resolve this as well.

Actions #19

Updated by robert.richardson 8 days ago

  • Status changed from Workable to Resolved
Actions #20

Updated by tinita 8 days ago

  • Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added
Actions

Also available in: Atom PDF