action #119767: Failed pipeline for "openqa-worker" in salt-states-openqa size:M - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #119767

closed

Failed pipeline for "openqa-worker" in salt-states-openqa size:M

Added by dheidler about 2 years ago. Updated about 2 years ago.

Status:

Resolved

Priority:

High

Assignee:

dheidler

Category:

Target version:

openQA Project (public) - Ready

Start date:

2022-11-02

Due date:

% Done:

Estimated time:

Tags:

reactive work

Description

Observation¶

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1217506

Retrieving: os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm [not found]
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
[ERROR   ] stderr: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
Please see the above error message for a hint.
[ERROR   ] retcode: 8
[ERROR   ] An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
…
          ID: worker.packages
    Function: pkg.installed
      Result: False
     Comment: Attempt 1: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              Attempt 2: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              Attempt 3: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              Attempt 4: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint.

Retried the pipeline for now: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1222315

Acceptance criteria¶

AC1: Pipeline passes again
AC2: It is known why the pipeline failed

Suggestions¶

Read the git history of what changes we applied in the past to the package installations
We already have instructed salt to call zypper multiple times for retry. But it looks like the repository data is not refreshed between each call. So we need to ensure that also the refreshing is done multiple times. In https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L8 we say "refresh: False" to save time but here it does not help us. So we should check if we change back to refresh how long it takes in comparison.
Maybe we can find something that only applies the refresh where it's actually necessary, e.g. split the repo statements for devel:openQA and do explicit refresh there, but not in other cases
Make sure to comment explicitly why certain things are done, e.g. why we would need a refresh
Conduct simple benchmark to find out what the impact of no-refresh vs. refresh vs. salt-default is in the gitlab CI pipeline and applying of state
According to salt docs the salt-default and not specifying should ensure that refresh is only done once but okurz doubts this works so needs to be verified, e.g. check salt with debug log level output

Actions

Copy link

Updated by dheidler about 2 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by okurz about 2 years ago

Target version set to Ready

Actions

Copy link

Updated by mkittler about 2 years ago

Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse to Failed pipeline for "openqa-worker" in salt-states-opensuse size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

Updated by dheidler about 2 years ago

Status changed from Workable to In Progress
Assignee set to dheidler

This looks like a repo issue or an issue regarding local copy of repo metadata being out of date.
PR as suggested: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/765

Actions

Copy link

Updated by dheidler about 2 years ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by okurz about 2 years ago

Due date set to 2022-11-18

Actions

Copy link

Updated by livdywan about 2 years ago

dheidler wrote:

This looks like a repo issue or an issue regarding local copy of repo metadata being out of date.
PR as suggested: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/765

This is still under review. Might be worth discussing with others since I feel like Dominik was expecting a more trivial fix.

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse size:M to Failed pipeline for "openqa-worker" in salt-states-opensuse
Due date deleted (~~2022-11-18~~)
Status changed from Feedback to New
Assignee deleted (~~dheidler~~)

cdywan wrote:

dheidler wrote:

This looks like a repo issue or an issue regarding local copy of repo metadata being out of date.
PR as suggested: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/765

This is still under review. Might be worth discussing with others since I feel like Dominik was expecting a more trivial fix.

Then we need to rediscuss although I think the original ticket description already covers it:

We already have instructed salt to call zypper multiple times for retry. But it looks like the repository data is not refreshed between each call. So we need to ensure that also the refreshing is done multiple times. In https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L8 we say "refresh: False" to save time but here it does not help us. So we should check if we change back to refresh how long it takes in comparison.

meaning: It's not as simple is just putting "refresh: True" there. Also it wouldn't be "size:M" if it's just that, right?

Actions

Copy link

Updated by okurz about 2 years ago

Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse to Failed pipeline for "openqa-worker" in salt-states-opensuse size:M
Description updated (diff)
Status changed from New to Workable

Actions

Copy link

#10

Updated by mkittler about 2 years ago

Assignee set to mkittler

Actions

Copy link

#11

Updated by mkittler about 2 years ago

According to the documentation https://docs.saltproject.io/en/latest/ref/states/all/salt.states.pkg.html using refresh: True will slow us down as we have multiple pkg states and then a refresh would be done for all of them. I can nevertheless create a MR to see how bad it'll be. Keeping Salt's default might not be helpful. At least the documentation doesn't state that then a refresh would be done in case a retry is done. Neither the mentioned documentation nor https://docs.saltproject.io/en/latest/ref/states/requisites.html#retrying-states describe the interaction between refresh and retry. I'm also not sure how we would test ourselves how the behavior. We'd somehow needed to provoke the error and somehow trace whether a refresh is done.

Actions

Copy link

#12

Updated by mkittler about 2 years ago

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/776

CI runtimes on master (with refresh: False):

test-storage: 00:02:46
test-monitor: 00:03:38
test-worker: 00:12:30
test-webui: 00:05:28

CI runtimes with refresh: True:

test-storage: 00:03:37
test-monitor: 00:05:14
test-worker: 00:12:30
test-webui: 00:08:10

So it generally takes a few minutes longer. Strangely test-worker had the same runtime. Not sure whether that's acceptable.

Actions

Copy link

#13

Updated by mkittler about 2 years ago

Assignee deleted (~~mkittler~~)

I currently have enough tickets assigned. Maybe I'll pick this one up later. It would also make sense to discuss the outcome of my test (mentioned in the previous comment).

Actions

Copy link

#14

Updated by dheidler about 2 years ago

Status changed from Workable to Feedback
Assignee set to dheidler

I personally would consider everything below 15 minutes as acceptable - especially as it saves us time reacting on issues.

So I would go for merging this.
Any objections?

Actions

Copy link

#15

Updated by okurz about 2 years ago

Well, as mkittler tested the runtime does increase but not for the worker. However the additional time is not only necessary during CI runs but any time someone or a service tries to apply a salt high state which I consider significant. As we need to do some retrying anyway I would favor if we find a more efficient solution that tries the fastest way first and only refresh in any retries as necessary

Actions

Copy link

#16

Updated by dheidler about 2 years ago

Hm - we could set retry to true maybe with some env var that is only set when the pipeline is applied from gitlab. WDYT?
I don't know any way how (or even if) your idea could be achieved using salt.

Actions

Copy link

#17

Updated by okurz about 2 years ago

dheidler wrote:

Hm - we could set retry to true maybe with some env var that is only set when the pipeline is applied from gitlab. WDYT?
I don't know any way how (or even if) your idea could be achieved using salt.

This brought me to an idea: When we only want to effectively "retry" when running in CI jobs then let's do that, but not "refresh" but simply CI level retry:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/778

Actions

Copy link

#18

Updated by mkittler about 2 years ago

Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse size:M to Failed pipeline for "openqa-worker" in salt-states-openqa size:M

Actions

Copy link

#19

Updated by dheidler about 2 years ago

Status changed from Feedback to Resolved

Let's see if it happens again:
https://gitlab.suse.de/openqa/salt-states-openqa/-/pipelines/545047

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #119767

Failed pipeline for "openqa-worker" in salt-states-openqa size:M

Observation¶

Acceptance criteria¶

Suggestions¶

Updated by dheidler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by dheidler about 2 years ago

Updated by dheidler about 2 years ago

Updated by okurz about 2 years ago

Updated by livdywan about 2 years ago

Updated by okurz about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by mkittler about 2 years ago

Updated by dheidler about 2 years ago

Updated by okurz about 2 years ago

Updated by dheidler about 2 years ago

Updated by okurz about 2 years ago

Updated by mkittler about 2 years ago

Updated by dheidler about 2 years ago