Project

General

Profile

Actions

action #96010

closed

[qem] test fails in hawk_gui acquiring a lock as the support server ended prematurely after a '503 response: Service Unavailable; URL was http://openqa.suse.de/api/v1/mm/children'

Added by martinsmac over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-07-26
Due date:
% Done:

0%

Estimated time:

Description

Observation

openQA test in scenario sle-15-SP1-Server-DVD-HA-Incidents-x86_64-qam_ha_hawk_client@64bit fails in
hawk_gui

Test suite description

The base test suite is used for job templates defined in YAML documents. It has no settings of its own.

Reproducible

Fails since (at least) Build :20208:fence-agents

Expected result

Last good: :20487:novnc (or more recent)

Further details

Always latest result in this scenario: latest

The test is failing in several steps, the most common is this one from the ticket. Need more research to determine what the problem is.


Related issues 1 (0 open1 closed)

Copied to openQA Project - action #98940: mmapi calls can still fail despite retriesResolvedokurz

Actions
Actions #1

Updated by okurz over 2 years ago

  • Project changed from openQA Tests to openQA Project
  • Subject changed from [qem] test fails in hawk_gui to [qem] test fails in hawk_gui acquiring a lock as the support server ended prematurely after a '503 response: Service Unavailable; URL was http://openqa.suse.de/api/v1/mm/children'
  • Category changed from Bugs in existing tests to Regressions/Crashes
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Priority changed from Normal to High
  • Target version set to Ready

not seen in before but what I see here:

[2021-07-26T12:47:42.719 CEST] [debug] Waiting for 3 jobs to finish
[2021-07-26T12:47:43.734 CEST] [debug] get_children: 503 response: Service Unavailable; URL was http://openqa.suse.de/api/v1/mm/children
[2021-07-26T12:47:43.734 CEST] [debug] Waiting for 0 jobs to finish

so it looks like the worker failed to reach osd and treated that as not needing to wait anymore. I will take a deeper look

Actions #2

Updated by okurz over 2 years ago

  • Due date set to 2021-08-09
  • Status changed from In Progress to Feedback
Actions #3

Updated by okurz over 2 years ago

https://github.com/os-autoinst/os-autoinst/pull/1730 merged, should wait some days to see if there is any effect

Actions #4

Updated by okurz over 2 years ago

So in the above PR I added a fix, better logging output, more test coverage. What could we do about monitoring or our processes?

Actions #5

Updated by okurz over 2 years ago

  • Status changed from Feedback to Resolved

no problems identified after deployment. Suggested follow-up improvement: #96191

Actions #6

Updated by dzedro over 2 years ago

  • Status changed from Resolved to Feedback

This failures are happening "lately every day".

[2021-09-17T08:06:54.202 CEST] [debug] api_call_2 failed, retries left: 2 of 3
[2021-09-17T08:06:57.208 CEST] [debug] api_call_2 failed, retries left: 1 of 3
[2021-09-17T08:07:00.213 CEST] [debug] api_call_2 failed, retries left: 0 of 3
[2021-09-17T08:07:03.214 CEST] [debug] get_children: 503 response: Service Unavailable; URL was http://openqa.suse.de/api/v1/mm/children

https://openqa.suse.de/tests/7146596
https://openqa.suse.de/tests/7146593
https://openqa.suse.de/tests/7146604
https://openqa.suse.de/tests/7146614
https://openqa.suse.de/tests/7146646

Actions #7

Updated by livdywan over 2 years ago

Looks the same indeed. Tho I'm surprised to see this after 2 months - did this go unnoticed or did it only start happening again? 🤔️

Actions #8

Updated by dzedro over 2 years ago

cdywan wrote:

Looks the same indeed. Tho I'm surprised to see this after 2 months - did this go unnoticed or did it only start happening again? 🤔️

I would say it started to happen again, I didn't see this kind of fail in between last 2 months.
Today only one fail on aggregates. https://openqa.suse.de/tests/7170466

Actions #9

Updated by okurz over 2 years ago

  • Copied to action #98940: mmapi calls can still fail despite retries added
Actions #10

Updated by okurz over 2 years ago

  • Due date deleted (2021-08-09)
  • Status changed from Feedback to Resolved

I will try with longer and more retries in #98940

Actions

Also available in: Atom PDF