Project

General

Profile

Actions

action #179038

closed

coordination #154777: [saga][epic] Shareable os-autoinst and test distribution plugins

coordination #162131: [epic] future version control related features in openQA

Gracious handling of longer remote git clones outages size:S

Added by robert.richardson 2 months ago. Updated 13 days ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2025-03-17
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Currently, git_clone minion jobs fail when GitLab is temporarily unreachable (see #178492), by introducing a proper error-handling mechanism, we can ensure:

  • Temporary outages do not cause unnecessary job failures or alerts.

User Story

"As a test engineer and openQA operator,
i want openQA to handle short-lived GitLab outages without causing mass Minion job failures,
so that users do not experience unnecessary disruption

Acceptance Criteria

  • AC1: Temporary remote git outages don't cause failing minion jobs
  • AC2: An update of remote git repositories is still ensured on shorter failed requests, e.g. in range of seconds

Suggestions

  • Damage is likely limited. If we can't sync needles nobody can edit needles.
  • Jobs end up incomplete if there's an on-going issue with git_clone minion jobs
  • We could decide to eventually give up and continue anyway and let jobs run

Related issues 4 (1 open3 closed)

Related to openQA Infrastructure (public) - action #178492: [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:SResolvedrobert.richardson2025-03-07

Actions
Related to openQA Infrastructure (public) - action #182021: [alert] web UI: Too many Minion job failures alertResolvedtinita2025-01-23

Actions
Copied to openQA Project (public) - action #179185: Detection of long-time remote git clone outages size:SWorkable2025-03-17

Actions
Copied to openQA Project (public) - action #180863: Conduct lessons learned "Five Why" analysis for "Gracious handling of longer remote git clones outages" size:SResolvedlivdywan

Actions
Actions #1

Updated by robert.richardson 2 months ago

  • Description updated (diff)
Actions #2

Updated by robert.richardson 2 months ago

  • Related to action #178492: [alert] Many failing `git_clone` Minion jobs auto_review:"Error detecting remote default branch name":retry size:S added
Actions #3

Updated by robert.richardson 2 months ago

  • Description updated (diff)
Actions #4

Updated by okurz 2 months ago

  • Target version changed from Ready to Tools - Next
Actions #5

Updated by okurz 2 months ago

  • Subject changed from Improve GitLab Outage Handling in openQA to Gracious handling of longer remote git clones outages size:S
  • Description updated (diff)
Actions #6

Updated by okurz 2 months ago

  • Parent task set to #162131
Actions #7

Updated by okurz 2 months ago

  • Copied to action #179185: Detection of long-time remote git clone outages size:S added
Actions #8

Updated by okurz 2 months ago

  • Status changed from New to Workable
  • Target version changed from Tools - Next to Ready

Needed for #178492

Actions #9

Updated by robert.richardson 2 months ago

  • Assignee set to robert.richardson
Actions #10

Updated by robert.richardson 2 months ago

  • Status changed from Workable to In Progress
Actions #11

Updated by openqa_review 2 months ago

  • Due date set to 2025-04-08

Setting due date based on mean cycle time of SUSE QE Tools

Actions #12

Updated by livdywan about 2 months ago

  • Status changed from In Progress to Workable
Actions #13

Updated by livdywan about 2 months ago

  • Due date deleted (2025-04-08)
Actions #14

Updated by livdywan about 2 months ago

  • Due date set to 2025-04-11
  • Status changed from Workable to In Progress

robert.richardson wrote in #note-10:

WIP Pull Request

Being discussed and reviewed

Actions #15

Updated by robert.richardson about 1 month ago

  • Status changed from In Progress to Feedback
Actions #16

Updated by livdywan about 1 month ago

  • Copied to action #180863: Conduct lessons learned "Five Why" analysis for "Gracious handling of longer remote git clones outages" size:S added
Actions #17

Updated by livdywan about 1 month ago

  • Status changed from Feedback to Resolved

livdywan wrote in #note-14:

robert.richardson wrote in #note-10:

WIP Pull Request

Being discussed and reviewed

Merged 🤞🥳

Actions #18

Updated by okurz about 1 month ago

  • Due date deleted (2025-04-11)
Actions #19

Updated by tinita 19 days ago

  • Status changed from Resolved to Workable
  • Priority changed from Normal to High

https://openqa.suse.de/minion/jobs?id=15460188

result: |
  Can't open file "/var/lib/openqa/git_server_outage.gitlab.suse.de.flag": Permission denied at /usr/share/openqa/script/../lib/OpenQA/Git/ServerAvailability.pm line 33.
Actions #20

Updated by okurz 19 days ago

  • Assignee deleted (robert.richardson)
Actions #21

Updated by nicksinger 19 days ago

this also affected users as can be seen in https://suse.slack.com/archives/C02CANHLANP/p1746528651080959 - I manually fixed it by executing chown geekotest:root /var/lib/openqa/ to allow the process to write into the folder itself - according to user reports, this was enough to make it work. Either we change the path or have to ensure /var/lib/openqa/ has the correct permission and/or owner

Actions #22

Updated by tinita 17 days ago

  • Related to action #182021: [alert] web UI: Too many Minion job failures alert added
Actions #23

Updated by tinita 17 days ago

In order to avoid an alert I deleted the failing git_clone task except for the last one linked here.

Actions #24

Updated by mkittler 17 days ago

  • Status changed from Workable to In Progress
  • Assignee set to mkittler
Actions #25

Updated by mkittler 17 days ago

  • Status changed from In Progress to Feedback

nicksinger wrote in #note-21:

this also affected users as can be seen in https://suse.slack.com/archives/C02CANHLANP/p1746528651080959 - I manually fixed it by executing chown geekotest:root /var/lib/openqa/ to allow the process to write into the folder itself - according to user reports, this was enough to make it work. Either we change the path or have to ensure /var/lib/openqa/ has the correct permission and/or owner

It looks like the effect of this manual fix is gone (probably because a package update reverted the ownership):

martchus@openqa:~> sudo -u geekotest touch /var/lib/openqa/foo
touch: cannot touch '/var/lib/openqa/foo': Permission denied

I created https://github.com/os-autoinst/openQA/pull/6436 to use the webui subdir which is actually what we want and it would be writable:

martchus@openqa:~> sudo -u geekotest touch /var/lib/openqa/foo
touch: cannot touch '/var/lib/openqa/foo': Permission denied
martchus@openqa:~> sudo -u geekotest touch /var/lib/openqa/webui/foo
martchus@openqa:~> echo $?
0
martchus@openqa:~> l /var/lib/openqa/webui/foo
-rw-r--r-- 1 geekotest nogroup 0 May  8 13:11 /var/lib/openqa/webui/foo

It is also already covered by our AppArmor profile.

Actions #26

Updated by mkittler 13 days ago

  • Status changed from Feedback to Resolved

I checked select id, state, task, priority, args, created, started, retries from minion_jobs where task = 'git_clone' and state = 'failed' order by created desc limit 100; and there were not further instances of the problem. So there's nothing to clean up besides the one job left by @tinita which I have now also deleted.

Actions

Also available in: Atom PDF