action #151138
closed[openQA][aarch64][media] 15-SP6 Build39.1 media for aarch64 cleaned up while suse.asia instance was still using it size:S
Description
Observation¶
Test run failed because media can not be found, https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/.
Steps to reproduce¶
- Check the latest 15-SP6 media Build39.1
Impact¶
All aarch64 test run failed, for example:
test1
test2
test3
Problem¶
15-SP6 Build39.1 aarch64 full media does not exist on OSD:
https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/
Suggestions¶
- Check media availability from https://openqa.suse.de/admin/assets as well as locally under /var/lib/openqa/share/factory/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/ and other builds for the same product
- Check journal of openqa-gru service for cleanup of "SLE-15-SP6-Full-aarch64-Build39.1-Media1"
- Check available storage on OSD. Maybe the issue is just a symptom of job groups not configured big enough if we have more available space?
- Ensure the reporter and we understand that relying on openQA assets from any external service is a bad idea
Workaround¶
n/a
Out of scope¶
Fix openqa.qa2.suse.asia to always have access to such assets from OSD
Updated by waynechen55 12 months ago
aarch64 Beta1 testing can not be done without this media.
Updated by tinita 12 months ago
@waynechen55 your links are pointing to openqa.qa2.suse.asia, e.g. http://openqa.qa2.suse.asia/tests/65261#step/guest_installation_run/21
And your links to the assets are pointing to openqa.suse.de.
We're not sure how that is connected to openqa.suse.de.
Btw, http://openqa.qa2.suse.asia/changelog was last updated on Sep 11.
Updated by waynechen55 12 months ago
The test suite run on openqa.qa2.suse.asia, but it uses full media on OSD for each new build. So for Build39.1, media https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/ is needed. I do not think this is news, because it is always like this even dated back to 15-SPx.
Updated by waynechen55 12 months ago
tinita wrote in #note-4:
@waynechen55 your links are pointing to openqa.qa2.suse.asia, e.g. http://openqa.qa2.suse.asia/tests/65261#step/guest_installation_run/21
And your links to the assets are pointing to openqa.suse.de.
We're not sure how that is connected to openqa.suse.de.
Btw, http://openqa.qa2.suse.asia/changelog was last updated on Sep 11.
The test suite run on openqa.qa2.suse.asia, but it uses full media on OSD for each new build. So for Build39.1, media https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/ is needed. I do not think this is news, because it is always like this even dated back to 15-SPx.
Updated by livdywan 12 months ago
waynechen55 wrote in #note-7:
tinita wrote in #note-4:
@waynechen55 your links are pointing to openqa.qa2.suse.asia, e.g. http://openqa.qa2.suse.asia/tests/65261#step/guest_installation_run/21
And your links to the assets are pointing to openqa.suse.de.
We're not sure how that is connected to openqa.suse.de.
Btw, http://openqa.qa2.suse.asia/changelog was last updated on Sep 11.The test suite run on openqa.qa2.suse.asia, but it uses full media on OSD for each new build. So for Build39.1, media https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/ is needed. I do not think this is news, because it is always like this even dated back to 15-SPx.
I'll re-phrase this a little. We as a team aren't aware of this setup and it looks like just another instance we don't maintain. So if you need help we need to know how it's supposed to work.
So I suggest
- Please ensure the instance is updated to ensure it's not affected by known bugs
- Clarify how assets are synched from osd
Then we confirm if there's an openQA bug here, or what the issue with the asset handling is.
Updated by waynechen55 12 months ago · Edited
livdywan wrote in #note-8:
waynechen55 wrote in #note-7:
tinita wrote in #note-4:
@waynechen55 your links are pointing to openqa.qa2.suse.asia, e.g. http://openqa.qa2.suse.asia/tests/65261#step/guest_installation_run/21
And your links to the assets are pointing to openqa.suse.de.
We're not sure how that is connected to openqa.suse.de.
Btw, http://openqa.qa2.suse.asia/changelog was last updated on Sep 11.The test suite run on openqa.qa2.suse.asia, but it uses full media on OSD for each new build. So for Build39.1, media https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/ is needed. I do not think this is news, because it is always like this even dated back to 15-SPx.
I'll re-phrase this a little. We as a team aren't aware of this setup and it looks like just another instance we don't maintain. So if you need help we need to know how it's supposed to work.
So I suggest
- Please ensure the instance is updated to ensure it's not affected by known bugs
- Clarify how assets are synched from osd
Then we confirm if there's an openQA bug here, or what the issue with the asset handling is.
- There is an openQA instance in Beijing, http://openqa.qa2.suse.asia/, which is maintained locally. So there is regular update.
- Take this test http://openqa.qa2.suse.asia/tests/65261 as an example. It runs on openqa.qa2.suse.aisa and it installs virtual machine at the final step. The installation uses full media for Build39.1, namely https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1 on OSD. But the media is empty, which leads to test run failure. So my request to make sure the mounted media is not empty.
- Actually, test run with the last Build37.1 went very well. The corresponding media https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build37.1-Media1/ exists on OSD. But corresponding media for Build39.1 is empty.
- So if the Build39.1 media is empty, it may indicate a OSD problem with regard to storage or syncing which needs to be solved by you or someone else.
Updated by openqa_review 12 months ago
- Due date set to 2023-12-05
Setting due date based on mean cycle time of SUSE QE Tools
Updated by waynechen55 12 months ago
Can you help fix this asap ? @livdywan If not, what prevents you from proceeding ?
Updated by livdywan 12 months ago
waynechen55 wrote in #note-12:
Can you help fix this asap ? @livdywan If not, what prevents you from proceeding ?
Maybe you overlooked my question. I'm not clear how the sync happens. If OSD is not aware of those jobs you're running it might be cleaning up the assets because I don't think we have any logic that checks what is being used on other instances.
A possible work-around could be to increase storage limits in affected groups. Or maybe you need to ensure the assets are always present on the same instance. I don't know why or if that's not already the case.
Updated by waynechen55 12 months ago · Edited
livdywan wrote in #note-13:
waynechen55 wrote in #note-12:
Can you help fix this asap ? @livdywan If not, what prevents you from proceeding ?
Maybe you overlooked my question. I'm not clear how the sync happens. If OSD is not aware of those jobs you're running it might be cleaning up the assets because I don't think we have any logic that checks what is being used on other instances.
A possible work-around could be to increase storage limits in affected groups. Or maybe you need to ensure the assets are always present on the same instance. I don't know why or if that's not already the case.
- Because the Build39.1 just came out not long ago and those test suites on Beijing openQA are triggered hours later after its delivery, so I think OSD just cleans it up due to some other reasons, for example, storage limit.
- Whether OSD is aware of the Beijing openQA instance or not, it will not lead to this issue, because the elapsed time after Build39.1 delivery but before I spotted the issue is too short. I do not think it is long enough for OSD to clean up the asset. It more looks like other reasons, for example, storage limit. Maybe you can tell me how long OSD will wait before clean asset up if it is not being used.
- I am not very sure about the "sync" between OSD and Beijing openQA. If what I said above in the second item is true, then it is not "sync" issue anyway. If it is not the case, I will try to confirm whether there is "sync" which may indicate a bug in OSD. So back to the above item in the first place, do think it is true ?
- Additionally, Build37.1 still exists and not cleaned up by OSD and test run with Build37.1 had already finished almost week ago. But newer Build39.1 was cleaned up, so it looks more obvious that it has nothing to do with "sync". OSD must clean newer Build39.1 for some other reasons.
Updated by livdywan 12 months ago
waynechen55 wrote in #note-16:
I recalled that I had ever opened a similar issue #119215 which was fixed and resolved by @okurz. It is also about full media for aarch64. You @livdywan can also have a look. It might be helpful.
So I just took a look at the product log and it shows 4 days ago geekotest scheduled SLE 15-SP6 Full aarch64 39.1 SLE-15-SP6-Full-aarch64-Build39.1-Media1.iso
. The first scheduled job is https://openqa.suse.de/tests/12836935, which ran successfully and shows https://openqa.suse.de/tests/12836935/asset/iso/SLE-15-SP6-Full-aarch64-Build39.1-Media1.iso ~10 hours earlier than the failing jobs you linked. The SLE 15 job group currently has no size limit, the YAST job group has 240GB. The used size looks to be identical to the limit so I would assume things have been cleaned up too fast. As I mentioned before, short of having any sort of sync you can increase the limit.
I hope that's clearer. I don't know that I can suggest much else at this point.
Updated by waynechen55 12 months ago
@livdywan When you think we can have this full media Build39.1 back ? Tomorrow or next week ?
Updated by livdywan 12 months ago
waynechen55 wrote in #note-18:
@livdywan When you think we can have this full media Build39.1 back ? Tomorrow or next week ?
Publishing images again is outside of the Tools team's scope. I'm happy to be part of a public Slack thread, but otherwise don't know what to offer besides the points we discussed.
Updated by waynechen55 12 months ago
Hope there is more robust persistent solution instead of increasing storage limit every time and praying for next build.
Updated by waynechen55 12 months ago
By the way, the iso media is still there https://openqa.suse.de/assets/iso/SLE-15-SP6-Full-aarch64-Build39.1-Media1.iso. What got cleaned up is mounted repo https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/. I was wondering maybe you can just mounted the iso to folder again to help solve the problem. @livdywan
Updated by okurz 12 months ago
waynechen55 wrote in #note-20:
Hope there is more robust persistent solution instead of increasing storage limit every time and praying for next build.
Yes. A more robust solution would be to not rely on assets from OSD unless for tests that exclusively run on OSD. openQA needs to clean up assets because we have to work with the space we have. We can't keep assets for a potential external system as openQA does not know when the assets wouldn't be needed anymore. The best approach would be to sync over assets from IBS to any openQA instance that uses assets
waynechen55 wrote in #note-21:
By the way, the iso media is still there https://openqa.suse.de/assets/iso/SLE-15-SP6-Full-aarch64-Build39.1-Media1.iso. What got cleaned up is mounted repo https://openqa.suse.de/assets/repo/SLE-15-SP6-Full-aarch64-Build39.1-Media1/. I was wondering maybe you can just mounted the iso to folder again to help solve the problem. @livdywan
That's not a "mounted repo", it's what was synced over from IBS or directly extracted from the ISO. If 39.1 is still the latest build published on IBS then you could re-execute the sync calls that are also visible on the openQA webUI. Compare to https://openqa.opensuse.org/admin/obs_rsync
Updated by waynechen55 12 months ago
To my understanding, cleanup should clean the oldest asset up in the first place instead the latest one. But in this case, it cleaned the latest Build39.1 up instead any older ones.
Updated by livdywan 12 months ago
- Priority changed from Urgent to Normal
waynechen55 wrote in #note-23:
To my understanding, cleanup should clean the oldest asset up in the first place instead the latest one. But in this case, it cleaned the latest Build39.1 up instead any older ones.
That can happen if older ones are still used by jobs.
I'm lowering the priority since we've discussed how openQA handles assets and it works as expected.
@waynechen55 Do you find that our docs on asset clean-up are missing anything we discussed here?
Updated by livdywan 12 months ago
- Subject changed from [openQA][aarch64][media] 15-SP6 Build39.1 full media does not exist for aarch64 size:S to [openQA][aarch64][media] 15-SP6 Build39.1 media for aarch64 cleaned up while suse.asia instance was still using it size:S
- Status changed from In Progress to Feedback
I'm also clarifying the title. Since the file was not missing. It was cleaned up.
Updated by waynechen55 12 months ago
We have a pull request https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18193 to enhance test run.