action #178492
closedopenQA Project (public) - coordination #154777: [saga][epic] Shareable os-autoinst and test distribution plugins
openQA Project (public) - coordination #162131: [epic] future version control related features in openQA
[alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S
0%
Description
Observation¶
We got many failing git_clone
Minion jobs (for at least several hours) on OSD. I briefly checked some of them and they failed with the following common error message:
result: "Error detecting remote default branch name for \"gitlab@gitlab.suse.de:openqa/os-autoinst-needles-sles.git\":
\ remote: \nremote: ========================================================================\nremote:
\nremote: Internal API unreachable\nremote: \nremote: ========================================================================\nremote:
\nfatal: Could not read from remote repository.\n\nPlease make sure you have the
correct access rights\nand the repository exists. at /usr/share/openqa/script/../lib/OpenQA/Git.pm
line 147.\n"
See https://openqa.suse.de/minion/jobs?task=git_clone&state=failed and https://monitor.qa.suse.de/d/WebuiDb/webui-summary?viewPanel=panel-19&orgId=1 and https://monitor.qa.suse.de/alerting/grafana/liA25iB4k/view?orgId=1 for details.
@mkittler cleaned up the Minion dashboard so the link above doesn't show the failures anymore. The oldest jobs were from 17 days ago.
This ticket is for handling this specific outage, the long-term solution to make openQA resilient to transient outages is covered in Feature Request #179038.
Suggestions¶
- Investigate if there were incomplete jobs due to this.
- Consider adding a regex to the ticket subject
- Damage is likely limited. If we can't sync needles nobody can edit needles.
- Ensure https://openqa.suse.de/minion/jobs?state=failed&offset=0&task=git_clone is empty
Rollback actions¶
- Remove silence
alertname=web UI: Too many Minion job failures alert
from https://monitor.qa.suse.de/alerting/silences?alertmanager=grafana
Updated by robert.richardson 3 months ago
- Status changed from Workable to In Progress
- Assignee set to robert.richardson
Updated by openqa_review 3 months ago
- Due date set to 2025-03-26
Setting due date based on mean cycle time of SUSE QE Tools
Updated by robert.richardson 3 months ago
- Subject changed from [alert] Many failing `git_clone` Minion jobs size:S to [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S
Updated by robert.richardson 3 months ago · Edited
@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like
As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.
Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?
Updated by okurz 3 months ago
robert.richardson wrote in #note-7:
@okurz My attempt of a user story for this in a feature request context, as you have asked in earlier, would be something be like
As a test engineer and openQA operator, i want openQA to handle short-lived GitLab outages without causing mass Minion job failures, so that users do not experience unnecessary disruption, while prolonged outages are still detected and reported effectively.
Should i really create a seperate feature ticket for this as you mentioned though, wouldn't the AC and suggestions be more or less the same as for this ticket ?
Yes, you can take over the AC and suggestions from this ticket for the new openQA feature request and revert this ticket back to just what is relevant for the "infra" part. I set priority to "High" as at the time the issue was and still should be an alert reaction ticket. Now we try to pack a feature request into it and that should be handled explicitly separately.
Updated by robert.richardson 3 months ago
- Related to action #179038: Gracious handling of longer remote git clones outages size:S added
Updated by livdywan 3 months ago
Looking at interval=60 ./openqa-query-for-job-l abel "poo#178492"
is not revealing any jobs. The default timeframe is 30 days. Maybe none of the jobs were labelled with the ticket?
We don't have any known affected job here so it's unclear if we are missing jobs. The minion job has no job ID's.
Updated by robert.richardson 3 months ago
Thanks again for the help, i also followed the scripts repos readme and ran ./openqa-review-failed
prior, hoping it would label any relevant jobs, but as @livdywan mentioned ./openqa-query-for-job-label "poo#178492"
does not output any jobs at all. As we cant link the according tests, the minion jobs have been cleaned and the failures happened almost a month ago, i'm resolving this ticket as discussed in the daily.
Note: I've created the related feature request for a proper outage handling mechanism in #179038
Updated by robert.richardson 3 months ago
- Status changed from In Progress to Resolved
Updated by robert.richardson about 2 months ago · Edited
I removed the silence, as the PR regarding git server outage handling is merged, i think we can resolve this as well.
Updated by robert.richardson about 2 months ago
- Status changed from Workable to Resolved
Updated by tinita about 2 months ago
- Related to action #180962: Many minion failures related to obs_rsync_run due to SLFO submissions size:S added