action #164222
closedClient error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout" size:S
0%
Description
Observation¶
https://openqa.suse.de/tests/14969481#step/GRU/1 shows
preparation failed: Downloading "http://automotive-3.arch.suse.de/gitlab/chuller/carwos-chuller_carwos1.1-2850918.qcow2" failed with: Download of "/var/lib/openqa/share/factory/hdd/carwos-chuller_carwos1.1-2850918.qcow2" failed: 521 Connect timeout
and a lot of others in the same job group. Apparently this happened because chuller wants to "resurrect" the carwos tests and struggles to get the communication up and running again. This can happen but we should ensure that we can read error messages correctly. Here the "521 Connect timeout" looks like an http response from "the server" but really seems to be a response from the client. Maybe mojo requests produce errors looking similar to http response codes. We should ensure these are properly labeled so that people cannot confuse them with responses from the server.
Acceptance criteria¶
- AC1: Error messages clearly state to the users what the consequence is for them
Suggestions¶
- Find the code responsible for downloading this (is it test code? Minion code? Something else?) and improve its log message
Rollback steps¶
- DONE Remove silence for https://monitor.qa.suse.de/alerting/silence/1356ce3d-d7ca-4750-b478-710cb0366d85/edit?alertmanager=grafana (name: "Incomplete jobs (not restarted) of last 24h alert")
Updated by nicksinger 5 months ago
- Status changed from New to In Progress
- Assignee set to nicksinger
Updated by livdywan 5 months ago
It seems like this job has never worked? https://openqa.suse.de/tests/14960353#step/GRU/1 is the oldest one I see in Next/Previous and it has the same issue 🤔
Updated by nicksinger 5 months ago
- Assignee deleted (
nicksinger) - Priority changed from Urgent to Normal
Asked chuller about the new scheduled jobs: https://suse.slack.com/archives/C02CANHLANP/p1721384884116119 - it is expected that they are running. Apparently also just scheduling a single job is not possible.
Created a silence for now to mitigate the priority: https://monitor.qa.suse.de/alerting/silence/1356ce3d-d7ca-4750-b478-710cb0366d85/edit?alertmanager=grafana
This might turn out as a openQA feature request because I'm not certain if a job which fails to download necessary assets - but not as preparation but as test-module - should end up incomplete.
I unassigne myself now because I don't know how I could continue.
Updated by nicksinger 5 months ago
- Description updated (diff)
- Assignee set to nicksinger
Updated by openqa_review 5 months ago
- Due date set to 2024-08-03
Setting due date based on mean cycle time of SUSE QE Tools
Updated by nicksinger 5 months ago
- Subject changed from asset download fails with "521 Connect timeout" to Client error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout"
- Description updated (diff)
- Status changed from In Progress to New
- Assignee deleted (
nicksinger)
Updated by mkittler 5 months ago · Edited
- Status changed from New to In Progress
- Assignee set to mkittler
I always found this bad. The reasoning for 521 is # Used by cloudflare to indicate web server is down.
(verbatim copy from our code) which refers to https://http.dev/521. However, we have nothing to do with Cloudfare so it makes no sense.
By the way, the 521 code was introduced by https://github.com/os-autoinst/openQA/pull/1337 but without much reasoning.
Updated by mkittler 5 months ago
- Status changed from In Progress to Feedback
This PR will remove the made-up HTTP response codes: https://github.com/os-autoinst/openQA/pull/5810
Besides this the error messages look good to me and probably the HTTP server that was supposed to serve these assets was really just not reachable at the time.
This might turn out as a openQA feature request because I'm not certain if a job which fails to download necessary assets - but not as preparation but as test-module - should end up incomplete.
This download actually happens as preparation and the error is only displayed as if it was produced by a test module. This error handling was implemented when the "reason" hasn't been introduced yet. When introducing the "reason" I made it so that this specific error case would also show up as reason but also kept the old way of inserting a fake test module. It would be easy to remove this fake test module and only rely on the "reason". So I can do this if that's wanted.
Updated by okurz 5 months ago
It would be easy to remove this fake test module and only rely on the "reason". So I can do this if that's wanted.
That would mean that the tests details page is empty and we show the autoinst-log.txt and worker log content directly, right? Hope that this doesn't cause more confusion then.
Updated by okurz 5 months ago
- Subject changed from Client error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout" to Client error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout" size:S
- Description updated (diff)
- Status changed from Feedback to Resolved