Project

General

Profile

Actions

action #164222

closed

Client error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout" size:S

Added by okurz 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Feature requests
Start date:
2024-07-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://openqa.suse.de/tests/14969481#step/GRU/1 shows

preparation failed: Downloading "http://automotive-3.arch.suse.de/gitlab/chuller/carwos-chuller_carwos1.1-2850918.qcow2" failed with: Download of "/var/lib/openqa/share/factory/hdd/carwos-chuller_carwos1.1-2850918.qcow2" failed: 521 Connect timeout 

and a lot of others in the same job group. Apparently this happened because chuller wants to "resurrect" the carwos tests and struggles to get the communication up and running again. This can happen but we should ensure that we can read error messages correctly. Here the "521 Connect timeout" looks like an http response from "the server" but really seems to be a response from the client. Maybe mojo requests produce errors looking similar to http response codes. We should ensure these are properly labeled so that people cannot confuse them with responses from the server.

Acceptance criteria

  • AC1: Error messages clearly state to the users what the consequence is for them

Suggestions

  • Find the code responsible for downloading this (is it test code? Minion code? Something else?) and improve its log message

Rollback steps

Actions #1

Updated by livdywan 5 months ago

Might be related to #163766? Although this is 521 Connect timeout versus Download error 598.

Actions #2

Updated by nicksinger 5 months ago

  • Status changed from New to In Progress
  • Assignee set to nicksinger
Actions #3

Updated by livdywan 5 months ago

It seems like this job has never worked? https://openqa.suse.de/tests/14960353#step/GRU/1 is the oldest one I see in Next/Previous and it has the same issue 🤔

Actions #4

Updated by nicksinger 5 months ago

  • Assignee deleted (nicksinger)
  • Priority changed from Urgent to Normal

Asked chuller about the new scheduled jobs: https://suse.slack.com/archives/C02CANHLANP/p1721384884116119 - it is expected that they are running. Apparently also just scheduling a single job is not possible.
Created a silence for now to mitigate the priority: https://monitor.qa.suse.de/alerting/silence/1356ce3d-d7ca-4750-b478-710cb0366d85/edit?alertmanager=grafana

This might turn out as a openQA feature request because I'm not certain if a job which fails to download necessary assets - but not as preparation but as test-module - should end up incomplete.
I unassigne myself now because I don't know how I could continue.

Actions #5

Updated by nicksinger 5 months ago

  • Description updated (diff)
  • Assignee set to nicksinger
Actions #6

Updated by openqa_review 5 months ago

  • Due date set to 2024-08-03

Setting due date based on mean cycle time of SUSE QE Tools

Actions #7

Updated by nicksinger 5 months ago

  • Subject changed from asset download fails with "521 Connect timeout" to Client error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout"
  • Description updated (diff)
  • Status changed from In Progress to New
  • Assignee deleted (nicksinger)
Actions #8

Updated by okurz 5 months ago

  • Tags changed from reactive work, infra to reactive work
  • Description updated (diff)
  • Due date deleted (2024-08-03)
  • Category changed from Regressions/Crashes to Feature requests

silence was already removed. Marked according rollback step as done.

Actions #9

Updated by mkittler 5 months ago · Edited

  • Status changed from New to In Progress
  • Assignee set to mkittler

I always found this bad. The reasoning for 521 is # Used by cloudflare to indicate web server is down. (verbatim copy from our code) which refers to https://http.dev/521. However, we have nothing to do with Cloudfare so it makes no sense.

By the way, the 521 code was introduced by https://github.com/os-autoinst/openQA/pull/1337 but without much reasoning.

Actions #10

Updated by mkittler 5 months ago

  • Status changed from In Progress to Feedback

This PR will remove the made-up HTTP response codes: https://github.com/os-autoinst/openQA/pull/5810

Besides this the error messages look good to me and probably the HTTP server that was supposed to serve these assets was really just not reachable at the time.


This might turn out as a openQA feature request because I'm not certain if a job which fails to download necessary assets - but not as preparation but as test-module - should end up incomplete.

This download actually happens as preparation and the error is only displayed as if it was produced by a test module. This error handling was implemented when the "reason" hasn't been introduced yet. When introducing the "reason" I made it so that this specific error case would also show up as reason but also kept the old way of inserting a fake test module. It would be easy to remove this fake test module and only rely on the "reason". So I can do this if that's wanted.

Actions #11

Updated by okurz 5 months ago

It would be easy to remove this fake test module and only rely on the "reason". So I can do this if that's wanted.

That would mean that the tests details page is empty and we show the autoinst-log.txt and worker log content directly, right? Hope that this doesn't cause more confusion then.

Actions #12

Updated by okurz 5 months ago

  • Subject changed from Client error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout" to Client error messages look like http response codes - we should phrase them better - was: asset download fails with "521 Connect timeout" size:S
  • Description updated (diff)
  • Status changed from Feedback to Resolved
Actions

Also available in: Atom PDF