action #103128: gitlab CI pipelines sporadically fail with "Could not resolve host: gitlab.suse.de", e.g. Recovery pipelines for ARM workers might fail during the maintenance window size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #103128

closed

gitlab CI pipelines sporadically fail with "Could not resolve host: gitlab.suse.de", e.g. Recovery pipelines for ARM workers might fail during the maintenance window size:M

Added by mkittler over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-11-26

Due date:

% Done:

Estimated time:

Description

Observation¶

E.g. recently the recovery of arm-1 didn't work due to the usual network problems during the maintenance windows:

fatal: unable to access 'https://gitlab.suse.de/openqa/grafana-webhook-actions.git/': Could not resolve host: gitlab.suse.de

(https://gitlab.suse.de/openqa/grafana-webhook-actions/-/jobs/711927)

This left the worker unrecovered until the problem became apparent on the OSD deployment which happened the next day. However, the notification mail about the failed pipeline was visible and one just had to restart the pipeline after the maintenance window (which was simply forgotten). That there was no mail for the firing long-term alert didn't help either.

Note that blindly retriggering the pipeline later would not be ideal because when the worker has already been recovered anyways it would needlessly trigger a power cycle.

Acceptance criteria¶

AC1: The gitlab CI pipeline does not fail while gitlab.suse.de is not accessible (e.g. not triggered at all during this time or retried or worked around)

Suggestions¶

Research if we can prevent triggering the pipeline from grafana during the SUSE IT maintenance window
Research if we can retry until the git repo can be reached again
Currently the long-term alert https://monitor.qa.suse.de/d/1bNU0StZz/automatic-actions?tab=alert&editPanel=5&orgId=1 does not have a notification target configured. Research in our git history why we did not enable this. Maybe we can just enable that and send email to osd-admins@suse.de after we also try to create tickets automatically in the CI pipeline

Further details¶

The EngInfra ticket regarding decommission of CAASP cluster is tracked in https://jira.suse.com/browse/ENGINFRA-705

Actions

Copy link

Updated by okurz over 3 years ago

Target version set to Ready

Did you mean "fail" in the subject line?

I assume the code that failed is within gitlab-runner code. Not running any gitlab CI jobs within the suse.de domain should of course prevent running into network problems during that time :) That should be something we can suggest to SUSE IT.

Actions

Copy link

Updated by mkittler over 3 years ago

Subject changed from Recovery pipelines for ARM workers might file during the maintenance window to Recovery pipelines for ARM workers might fail during the maintenance window

Actions

Copy link

Updated by livdywan about 3 years ago

Subject changed from Recovery pipelines for ARM workers might fail during the maintenance window to Recovery pipelines for ARM workers might fail during the maintenance window size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from Workable to In Progress
Assignee set to okurz

So apparently there had been two upstream feature requests to retry the initial git clone https://gitlab.com/gitlab-org/gitlab-runner/-/issues/2296 and https://gitlab.com/gitlab-org/gitlab-docs/-/issues/32 , both rejected. It seems that currently gitlab CI does not support retry for the initial git clone out of the box. It might be an option to actually configure gitlab CI to not clone a git at all and we just do it ourselves manually in the script section with retrying and backoff as the retry in https://docs.gitlab.com/ee/ci/yaml/#retry is only for the job after the initial git clone. Before we do that we should clarify with EngInfra if they can find a better solution. I think there is a ticket somewhere regarding the deprecation of the CAASP cluster that is used to run the gitlab CI runners.

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from In Progress to Blocked

https://sd.suse.com/servicedesk/customer/portal/1/SD-70900

Actions

Copy link

Updated by okurz about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by okurz about 3 years ago

Subject changed from Recovery pipelines for ARM workers might fail during the maintenance window size:M to gitlab CI pipelines sporadically fail with "Could not resolve host: gitlab.suse.de", e.g. Recovery pipelines for ARM workers might fail during the maintenance window size:M
Status changed from Blocked to Feedback

@team please comment if you see the issue again with a reference to the according gitlab CI jobs and provide additional information here and/or in https://sd.suse.com/servicedesk/customer/portal/1/SD-70900

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from Feedback to Blocked

https://sd.suse.com/servicedesk/customer/portal/1/SD-70900 has an update. Jiri Novak found a problem with DNS config on kubernetes hosts and is working on that.

Actions

Copy link

Updated by okurz about 3 years ago

Status changed from Blocked to Resolved

The SD ticket was resolved with some fixes to the DNS infrastructure. As it seems that Jiri Novak did a proper test themselves I consider this resolved. If any of you see this again please raise it up again

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #103128

gitlab CI pipelines sporadically fail with "Could not resolve host: gitlab.suse.de", e.g. Recovery pipelines for ARM workers might fail during the maintenance window size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Further details¶

Updated by okurz over 3 years ago

Updated by mkittler over 3 years ago

Updated by livdywan about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago

Updated by okurz about 3 years ago