action #179038
closed
coordination #154777: [saga][epic] Shareable os-autoinst and test distribution plugins
coordination #162131: [epic] future version control related features in openQA
Gracious handling of longer remote git clones outages size:S
Added by robert.richardson 2 months ago.
Updated 13 days ago.
Category:
Feature requests
Description
Motivation¶
Currently, git_clone minion jobs fail when GitLab is temporarily unreachable (see #178492), by introducing a proper error-handling mechanism, we can ensure:
- Temporary outages do not cause unnecessary job failures or alerts.
User Story¶
"As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption
Acceptance Criteria¶
-
AC1: Temporary remote git outages don't cause failing minion jobs
-
AC2: An update of remote git repositories is still ensured on shorter failed requests, e.g. in range of seconds
Suggestions¶
- Damage is likely limited. If we can't sync needles nobody can edit needles.
- Jobs end up incomplete if there's an on-going issue with git_clone minion jobs
- We could decide to eventually give up and continue anyway and let jobs run
- Description updated (diff)
- Related to action #178492: [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S added
- Description updated (diff)
- Target version changed from Ready to Tools - Next
- Subject changed from Improve GitLab Outage Handling in openQA to Gracious handling of longer remote git clones outages size:S
- Description updated (diff)
- Parent task set to #162131
- Copied to action #179185: Detection of long-time remote git clone outages size:S added
- Status changed from New to Workable
- Target version changed from Tools - Next to Ready
- Assignee set to robert.richardson
- Status changed from Workable to In Progress
- Due date set to 2025-04-08
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Workable
- Due date deleted (
2025-04-08)
- Due date set to 2025-04-11
- Status changed from Workable to In Progress
- Status changed from In Progress to Feedback
- Copied to action #180863: Conduct lessons learned "Five Why" analysis for "Gracious handling of longer remote git clones outages" size:S added
- Status changed from Feedback to Resolved
- Due date deleted (
2025-04-11)
- Status changed from Resolved to Workable
- Priority changed from Normal to High
- Assignee deleted (
robert.richardson)
this also affected users as can be seen in https://suse.slack.com/archives/C02CANHLANP/p1746528651080959 - I manually fixed it by executing chown geekotest:root /var/lib/openqa/
to allow the process to write into the folder itself - according to user reports, this was enough to make it work. Either we change the path or have to ensure /var/lib/openqa/
has the correct permission and/or owner
- Related to action #182021: [alert] web UI: Too many Minion job failures alert added
In order to avoid an alert I deleted the failing git_clone task except for the last one linked here.
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Status changed from In Progress to Feedback
nicksinger wrote in #note-21:
this also affected users as can be seen in https://suse.slack.com/archives/C02CANHLANP/p1746528651080959 - I manually fixed it by executing chown geekotest:root /var/lib/openqa/
to allow the process to write into the folder itself - according to user reports, this was enough to make it work. Either we change the path or have to ensure /var/lib/openqa/
has the correct permission and/or owner
It looks like the effect of this manual fix is gone (probably because a package update reverted the ownership):
martchus@openqa:~> sudo -u geekotest touch /var/lib/openqa/foo
touch: cannot touch '/var/lib/openqa/foo': Permission denied
I created https://github.com/os-autoinst/openQA/pull/6436 to use the webui
subdir which is actually what we want and it would be writable:
martchus@openqa:~> sudo -u geekotest touch /var/lib/openqa/foo
touch: cannot touch '/var/lib/openqa/foo': Permission denied
martchus@openqa:~> sudo -u geekotest touch /var/lib/openqa/webui/foo
martchus@openqa:~> echo $?
0
martchus@openqa:~> l /var/lib/openqa/webui/foo
-rw-r--r-- 1 geekotest nogroup 0 May 8 13:11 /var/lib/openqa/webui/foo
It is also already covered by our AppArmor profile.
- Status changed from Feedback to Resolved
I checked select id, state, task, priority, args, created, started, retries from minion_jobs where task = 'git_clone' and state = 'failed' order by created desc limit 100;
and there were not further instances of the problem. So there's nothing to clean up besides the one job left by @tinita which I have now also deleted.
Also available in: Atom
PDF