action #163112
closed
test fails in openqa_webui due to repeated and reproducible errors in reading from the devel:openQA repository "repodata…filelists-ext.xml.gz not found on medium" size:S
Added by okurz 6 months ago.
Updated about 1 month ago.
Category:
Regressions/Crashes
Description
Observation¶
openQA test in scenario openqa-Tumbleweed-dev-x86_64-openqa_install_nginx@64bit-2G fails in
openqa_webui
due to repeated and reproducible errors in reading from the devel:openQA repository "repodata…filelists-ext.xml.gz not found on medium"
on the command
retry -e -s 30 -- zypper -n --gpg-auto-import-keys ref
I assume the problem happens when devel:openQA is in the process of being refreshed due to frequent updates in devel:openQA however there should be a better way to ensure consistent and at best atomic updates of the repo content.
Expected result¶
Last good: :TW.29599 (or more recent)
Acceptance criteria¶
- AC1: The scenario latest passes consistently even if devel:openQA is frequently updated
Suggestions¶
- We "only" do 3 retries
- Consider how often retry in other cases, make it consistent
- Keep in mind the script timeout
- Research upstream if there is a better way to handle that, e.g. look into github.com/openSUSE/zypper/, mailing lists or forums regarding OBS/mirror/zypper behaviour. Also engage with domain experts in corresponding chat channels to find best practices and apply them. According to livdywan she already did all of that. So maybe we need to come up with ideas ourselves.
- Maybe we need to set something cool in the OBS project config to keep older data intact until new repository content is completely available?
Further details¶
Always latest result in this scenario: latest
- Subject changed from test fails in openqa_webui due to repeated and reproducible errors in reading from the devel:openQA repository "repodata…filelists-ext.xml.gz not found on medium" to test fails in openqa_webui due to repeated and reproducible errors in reading from the devel:openQA repository "repodata…filelists-ext.xml.gz not found on medium" size:S
- Description updated (diff)
- Status changed from New to Workable
- Related to action #162848: webui-docker-compose tests failing on GitHub PR's size:S added
- Tags changed from alert, infra to alert, infra, reactive work
- Project changed from openQA Tests (public) to openQA Project (public)
- Category deleted (
Bugs in existing tests)
- Priority changed from Normal to High
- Related to deleted (action #162848: webui-docker-compose tests failing on GitHub PR's size:S)
- Blocks action #162848: webui-docker-compose tests failing on GitHub PR's size:S added
- Category set to Regressions/Crashes
- Blocks deleted (action #162848: webui-docker-compose tests failing on GitHub PR's size:S)
- Status changed from Workable to In Progress
- Assignee set to mkittler
- Related to action #162848: webui-docker-compose tests failing on GitHub PR's size:S added
- Due date set to 2024-07-24
Setting due date based on mean cycle time of SUSE QE Tools
- Status changed from In Progress to Feedback
- Related to action #161729: [sporadic] test fails in containers/build of openqa-in-openqa probably due to temporary download.opensuse.org and zypper issues added
I asked again on our internal channel. I guess there were two main suggestions:
- Add -vvv flag to
zypper -vvv ref openQA
to see eventual details.
- Collect something
tail -n 400 /var/log/zypper.log
on failure to see if any mirror is involved
Both doesn't sound really promising and 2. is also in conflict with AC1 because I needed to remove the retry again. (Otherwise we would probably not be aware of the relevant jobs and never look into those logs after all.) I guess I'll leave checking the zypper log for when I encounter the issue when updating my local system or one of our servers manually.
But isn't an openQA test the perfect candidate to do this reproduction and log collection? IMHO the issue is more likely to happen if devel:openQA is rebuilt so consider triggering the tests just after/while devel:openQA content is building or trigger that recurringly to trigger the issue
- Status changed from Feedback to Workable
- Status changed from Workable to Resolved
I think this is too much effort for this specific and not so often happening problem - especially because those ideas are actually not that promising (they're just the only thing that came to mind).
- Due date deleted (
2024-07-24)
- Status changed from Resolved to In Progress
- Assignee changed from mkittler to okurz
- Priority changed from High to Low
ok, interesting. I think I will try to build in some of the mentioned debugging and try some things in openQA tests.
- Due date set to 2024-07-26
- Status changed from In Progress to Feedback
- Due date changed from 2024-07-26 to 2024-12-31
- Target version changed from Ready to Tools - Next
More progress:
(Oliver Kurz) Thx. Yes, we can. https://github.com/os-autoinst/os-autoinst-distri-openQA/pull/195 is now merged. We observed the issue also in many other places, e.g. gitlab CI jobs that we use for automatically deploying openQA and such but it's easier to reproduce in openQA-in-openQA tests. You stated
deployed a hotpatch and downloadcontent should now be used only for versioned files
so let's see if we hit the problem again at all
- Related to action #165399: Unable to use openqa-single-instance due to "Valid metadata not found at specified URL" reproducing often size:S added
- Due date deleted (
2024-12-31)
- Status changed from Feedback to Resolved
It seems like with changes on the mirror infrastructure and the retries we have applied on multiple levels we are not running into related problems anymore recently.
- Target version changed from Tools - Next to Ready
Also available in: Atom
PDF