Project

General

Profile

Actions

action #163928

closed

[alert] Openqa HTTP Response lost on 15-07-24 size:S

Added by ybonatakis 5 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Start date:
2024-07-15
Due date:
% Done:

0%

Estimated time:

Description

Observation

https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=1720996087861&to=1720999161474

I took a look at the logs which I attached in the ticket

I cant spot the actual problem. And the system seems to perform an update, and recovered after the restart of the services.
unresponsiveness took place from 00:42 to 01:05 (>20min)

looking at the logs I see some from telegraf

openqa telegraf[6820]: 2024-07-14T22:54:50Z E! [inputs.http] Error in plugin: [url=https://openqa.suse.de/admin/*]: Get "https://openqa.suse.de/admin/*": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

and many

Jul 15 00:54:29 openqa openqa[12024]: [debug] [pid:12024] _carry_over_candidate(14928963): ignoring job 14855612 with repeated problem                                                                                                       
Jul 15 00:54:29 openqa openqa[12024]: [debug] [pid:12024] _carry_over_candidate(14928963): checking take over from 14834954: _failure_reason=GOOD 

Files

alert_tm0h5mf4k_full (3.63 MB) alert_tm0h5mf4k_full truncated due to the upload size limit ybonatakis, 2024-07-15 08:44

Related issues 2 (0 open2 closed)

Related to openQA Infrastructure (public) - action #163775: Conduct "lessons learned" with Five Why analysis about many alerts, e.g. alerts not silenced for known issues size:SResolvedlivdywan2024-07-10

Actions
Is duplicate of openQA Infrastructure (public) - action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:MResolvedokurz2024-07-10

Actions
Actions #1

Updated by mkittler 5 months ago

  • Status changed from New to Rejected
  • Assignee set to mkittler

I guess it would have made more sense to add this information to #163592 instead of creating a new ticket. But thanks for your research, I added the information to #163592#note-34. I think we can close this ticket as duplicate, though.

Actions #2

Updated by mkittler 5 months ago

  • Is duplicate of action #163592: [alert] (HTTP Response alert Salt tm0h5mf4k) size:M added
Actions #3

Updated by okurz 5 months ago

  • Status changed from Rejected to Feedback
  • Assignee changed from mkittler to okurz

I have some questions about this ticket which I would like to get answered with you but also with the help from others from the team:

  1. why did you report this issue when we already have #163592
  2. why was the alert not disabled for #163592
  3. "system seems to perform an update": Where did you see that?
  4. why did you mention the _carry_over_candidate occurences?
Actions #4

Updated by ybonatakis 5 months ago ยท Edited

okurz wrote in #note-3:

I have some questions about this ticket which I would like to get answered with you but also with the help from others from the team:

  1. why did you report this issue when we already have #163592 I followed the recommendation from Liv on Slack which you participated as well
  2. why was the alert not disabled for #163592 I feel I cant answer that
  3. "system seems to perform an update": Where did you see that? I think the logs show that many workers were updated. I think also there is a cron job which run every night. no?
  4. why did you mention the _carry_over_candidate occurences? I dont know what _carry_over_candidate does exactly. It just appears a lot near the time frame

Corrected quoting by okurz:
okurz wrote in #note-3:

I have some questions about this ticket which I would like to get answered with you but also with the help from others from the team:

  1. why did you report this issue when we already have #163592

I followed the recommendation from Liv on Slack which you participated as well

  1. why was the alert not disabled for #163592

I feel I cant answer that

  1. "system seems to perform an update": Where did you see that?

I think the logs show that many workers were updated. I think also there is a cron job which run every night. no?

  1. why did you mention the _carry_over_candidate occurences?

I dont know what _carry_over_candidate does exactly. It just appears a lot near the time frame

Actions #5

Updated by okurz 5 months ago

  • Subject changed from [alert] Openqa HTTP Response lost on 15-07-24 to [alert] Openqa HTTP Response lost on 15-07-24 size:S
  • Description updated (diff)
Actions #6

Updated by okurz 5 months ago

  • Description updated (diff)
Actions #7

Updated by okurz 5 months ago

ybonatakis your quoting is broken but I can answer. In th markdown quoting make sure to separate each of your line from quoted lines with a separate blank line in between.

  1. why did you report this issue when we already have #163592 I followed the recommendation from Liv on Slack which you participated as well

You mean https://suse.slack.com/archives/C02AJ1E568M/p1721029947196429 where you asked about the telegraf error and _carry_over_candidate. You did not mention any useful context and did not mention "HTTP response" or https://stats.openqa-monitor.qa.suse.de/d/WebuiDb/webui-summary?orgId=1&viewPanel=78&from=1720996087861&to=1720999161474
If you would have mentioned that then certainly we would have suggested to not report a separate issue or create a new ticket with the relation. Were you aware about the existance of #163592?

  1. why was the alert not disabled for #163592 I feel I cant answer that

understood. The question goes more to others.

  1. "system seems to perform an update": Where did you see that? I think the logs show that many workers were updated. I think also there is a cron job which run every night. no?

there are many cron jobs running during multiple times of the day.

  1. why did you mention the _carry_over_candidate occurences? I dont know what _carry_over_candidate does exactly. It just appears a lot near the time frame

yes, but it's not related. That function belongs to the comment carry over when on a subsequent failed openQA test the comment is carried over.

Actions #8

Updated by okurz 5 months ago

understood. The question goes more to others.

we will handle that part explicitly today in the afternoon.

@ybonatakis can you confirm you got responses?

Actions #9

Updated by livdywan 5 months ago

  • Related to action #163775: Conduct "lessons learned" with Five Why analysis about many alerts, e.g. alerts not silenced for known issues size:S added
Actions #10

Updated by okurz 5 months ago

  • Status changed from Feedback to Resolved

Clarified all open points with ybonatakis

Actions

Also available in: Atom PDF