action #178492
closedopenQA Project (public) - coordination #154777: [saga][epic] Shareable os-autoinst and test distribution plugins
openQA Project (public) - coordination #162131: [epic] future version control related features in openQA
[alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S
0%
Description
Observation¶
We got many failing git_clone
Minion jobs (for at least several hours) on OSD. I briefly checked some of them and they failed with the following common error message:
result: "Error detecting remote default branch name for \"gitlab@gitlab.suse.de:openqa/os-autoinst-needles-sles.git\":
\ remote: \nremote: ========================================================================\nremote:
\nremote: Internal API unreachable\nremote: \nremote: ========================================================================\nremote:
\nfatal: Could not read from remote repository.\n\nPlease make sure you have the
correct access rights\nand the repository exists. at /usr/share/openqa/script/../lib/OpenQA/Git.pm
line 147.\n"
See https://openqa.suse.de/minion/jobs?task=git_clone&state=failed and https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&orgId=1 and https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1 for details.
@mkittler cleaned up the Minion dashboard so the link above doesn't show the failures anymore. The oldest jobs were from 17 days ago.
This ticket is for handling this specific outage, the long-term solution to make openQA resilient to transient outages is covered in Feature Request #179038.
Suggestions¶
- Investigate if there were incomplete jobs due to this.
- Consider adding a regex to the ticket subject
- Damage is likely limited. If we can't sync needles nobody can edit needles.
- Ensure https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=git_clone is empty
Rollback actions¶
- Remove silence
alertname=web UI: Too many Minion job failures alert
from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana
Updated by okurz about 2 months ago
- Tags changed from infra, alert to infra, alert, reactive work, osd
- Category set to Regressions/Crashes
- Priority changed from Normal to High
- Target version set to Ready
Updated by livdywan about 1 month ago
- Subject changed from [alert] Many failing `git_clone` Minion jobs to [alert] Many failing `git_clone` Minion jobs size:S
- Description updated (diff)
- Status changed from New to Workable
Updated by robert.richardson about 1 month ago
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
Updated by openqa_review about 1 month ago
- Due date set to 2025-03-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by robert.richardson about 1 month ago
- Subject changed from [alert] Many failing `git_clone` Minion jobs size:S to [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S
Updated by robert.richardson about 1 month ago · Edited
@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like
As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.
Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?
Updated by okurz about 1 month ago
robert.richardson wrote in #note-7:
@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like
As a test engineer and openQA operator, i want openQA to handle short-lived GitLab outages without causing mass Minion job failures, so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.
Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?
Yes, you can take over the AC and suggestions from this ticket for the new openQA feature request and revert this ticket back to just what is relevant for the "infra" part. I set priority to "High" as at the time the issue was and still should be an alert reaction ticket. Now we try to pack a feature request into it and that should be handled explicitly separately.
Updated by robert.richardson about 1 month ago
- Related to action #179038: Gracious handling of longer remote git clones outages size:S added
Updated by livdywan about 1 month ago
Looking at interval=60 ./openqa-query-for-job-l
is not revealing any jobs. The default timeframe is 30 days. Maybe none of the jobs were labelled with the ticket?
abel "poo#178492"
We don't have any known affected job here so it's unclear if we are missing jobs. The minion job has no job ID's.
Updated by robert.richardson about 1 month ago
Thanks again for the help, i also followed the scripts repos readme and ran ./openqa-review-failed
prior, hoping it would label any relevant jobs, but as @livdywan mentioned ./openqa-query-for-job-label "poo#178492"
does not output any jobs at all. As we cant link the according tests, the minion jobs have been cleaned and the failures happened almost a month ago, i'm resolving this ticket as discussed in the daily.
Note: I've created the related feature request for a proper outage handling mechanism in #179038
Updated by robert.richardson about 1 month ago
- Status changed from In Progress to Resolved
Updated by okurz about 1 month ago
- Due date deleted (
2025-03-26) - Parent task set to #162131
Updated by robert.richardson 8 days ago · Edited
I removed the silence, as the PR regarding git server outage handling is merged, i think we can resolve this as well.
Updated by robert.richardson 8 days ago
- Status changed from Workable to Resolved
Updated by tinita 8 days ago
- Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added