Project

General

Profile

action #119767

Failed pipeline for "openqa-worker" in salt-states-openqa size:M

Added by dheidler 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2022-11-02
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1217506

Retrieving: os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm [not found]
Abort, retry, ignore? [a/r/i/...? shows all options] (a): a
[ERROR   ] stderr: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
Please see the above error message for a hint.
[ERROR   ] retcode: 8
[ERROR   ] An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
Problem occurred during or after installation or removal of packages:
Installation has been aborted as directed.
…
          ID: worker.packages
    Function: pkg.installed
      Result: False
     Comment: Attempt 1: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              Attempt 2: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              Attempt 3: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              Attempt 4: Returned a result of "False", with the following comment: "An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint."
              An error was encountered while installing package(s): Zypper command failure: File './x86_64/os-autoinst-4.6.1666985981.c33e9ef-1421.1.x86_64.rpm' not found on medium 'http://download.opensuse.org/repositories/devel:/openQA/openSUSE_Tumbleweed/'
              Problem occurred during or after installation or removal of packages:
              Installation has been aborted as directed.
              Please see the above error message for a hint.

Retried the pipeline for now: https://gitlab.suse.de/openqa/salt-states-openqa/-/jobs/1222315

Acceptance criteria

  • AC1: Pipeline passes again
  • AC2: It is known why the pipeline failed

Suggestions

  • Read the git history of what changes we applied in the past to the package installations
  • We already have instructed salt to call zypper multiple times for retry. But it looks like the repository data is not refreshed between each call. So we need to ensure that also the refreshing is done multiple times. In https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L8 we say "refresh: False" to save time but here it does not help us. So we should check if we change back to refresh how long it takes in comparison.
  • Maybe we can find something that only applies the refresh where it's actually necessary, e.g. split the repo statements for devel:openQA and do explicit refresh there, but not in other cases
  • Make sure to comment explicitly why certain things are done, e.g. why we would need a refresh
  • Conduct simple benchmark to find out what the impact of no-refresh vs. refresh vs. salt-default is in the gitlab CI pipeline and applying of state
  • According to salt docs the salt-default and not specifying should ensure that refresh is only done once but okurz doubts this works so needs to be verified, e.g. check salt with debug log level output

History

#1 Updated by dheidler 3 months ago

  • Priority changed from Normal to High

#2 Updated by okurz 3 months ago

  • Target version set to Ready

#3 Updated by mkittler 3 months ago

  • Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse to Failed pipeline for "openqa-worker" in salt-states-opensuse size:M
  • Description updated (diff)
  • Status changed from New to Workable

#4 Updated by dheidler 3 months ago

  • Status changed from Workable to In Progress
  • Assignee set to dheidler

This looks like a repo issue or an issue regarding local copy of repo metadata being out of date.
PR as suggested: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/765

#5 Updated by dheidler 3 months ago

  • Status changed from In Progress to Feedback

#6 Updated by okurz 3 months ago

  • Due date set to 2022-11-18

#7 Updated by cdywan 3 months ago

dheidler wrote:

This looks like a repo issue or an issue regarding local copy of repo metadata being out of date.
PR as suggested: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/765

This is still under review. Might be worth discussing with others since I feel like Dominik was expecting a more trivial fix.

#8 Updated by okurz 3 months ago

  • Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse size:M to Failed pipeline for "openqa-worker" in salt-states-opensuse
  • Due date deleted (2022-11-18)
  • Status changed from Feedback to New
  • Assignee deleted (dheidler)

cdywan wrote:

dheidler wrote:

This looks like a repo issue or an issue regarding local copy of repo metadata being out of date.
PR as suggested: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/765

This is still under review. Might be worth discussing with others since I feel like Dominik was expecting a more trivial fix.

Then we need to rediscuss although I think the original ticket description already covers it:

We already have instructed salt to call zypper multiple times for retry. But it looks like the repository data is not refreshed between each call. So we need to ensure that also the refreshing is done multiple times. In https://gitlab.suse.de/openqa/salt-states-openqa/-/blob/master/openqa/server.sls#L8 we say "refresh: False" to save time but here it does not help us. So we should check if we change back to refresh how long it takes in comparison.

meaning: It's not as simple is just putting "refresh: True" there. Also it wouldn't be "size:M" if it's just that, right?

#9 Updated by okurz 3 months ago

  • Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse to Failed pipeline for "openqa-worker" in salt-states-opensuse size:M
  • Description updated (diff)
  • Status changed from New to Workable

#10 Updated by mkittler 2 months ago

  • Assignee set to mkittler

#11 Updated by mkittler 2 months ago

According to the documentation https://docs.saltproject.io/en/latest/ref/states/all/salt.states.pkg.html using refresh: True will slow us down as we have multiple pkg states and then a refresh would be done for all of them. I can nevertheless create a MR to see how bad it'll be. Keeping Salt's default might not be helpful. At least the documentation doesn't state that then a refresh would be done in case a retry is done. Neither the mentioned documentation nor https://docs.saltproject.io/en/latest/ref/states/requisites.html#retrying-states describe the interaction between refresh and retry. I'm also not sure how we would test ourselves how the behavior. We'd somehow needed to provoke the error and somehow trace whether a refresh is done.

#12 Updated by mkittler 2 months ago

MR: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/776

CI runtimes on master (with refresh: False):

  • test-storage: 00:02:46
  • test-monitor: 00:03:38
  • test-worker: 00:12:30
  • test-webui: 00:05:28

CI runtimes with refresh: True:

  • test-storage: 00:03:37
  • test-monitor: 00:05:14
  • test-worker: 00:12:30
  • test-webui: 00:08:10

So it generally takes a few minutes longer. Strangely test-worker had the same runtime. Not sure whether that's acceptable.

#13 Updated by mkittler 2 months ago

  • Assignee deleted (mkittler)

I currently have enough tickets assigned. Maybe I'll pick this one up later. It would also make sense to discuss the outcome of my test (mentioned in the previous comment).

#14 Updated by dheidler 2 months ago

  • Status changed from Workable to Feedback
  • Assignee set to dheidler

I personally would consider everything below 15 minutes as acceptable - especially as it saves us time reacting on issues.

So I would go for merging this.
Any objections?

#15 Updated by okurz 2 months ago

Well, as mkittler tested the runtime does increase but not for the worker. However the additional time is not only necessary during CI runs but any time someone or a service tries to apply a salt high state which I consider significant. As we need to do some retrying anyway I would favor if we find a more efficient solution that tries the fastest way first and only refresh in any retries as necessary

#16 Updated by dheidler 2 months ago

Hm - we could set retry to true maybe with some env var that is only set when the pipeline is applied from gitlab. WDYT?
I don't know any way how (or even if) your idea could be achieved using salt.

#17 Updated by okurz 2 months ago

dheidler wrote:

Hm - we could set retry to true maybe with some env var that is only set when the pipeline is applied from gitlab. WDYT?
I don't know any way how (or even if) your idea could be achieved using salt.

This brought me to an idea: When we only want to effectively "retry" when running in CI jobs then let's do that, but not "refresh" but simply CI level retry:
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/778

#18 Updated by mkittler 2 months ago

  • Subject changed from Failed pipeline for "openqa-worker" in salt-states-opensuse size:M to Failed pipeline for "openqa-worker" in salt-states-openqa size:M

#19 Updated by dheidler about 2 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF