Project

General

Profile

Actions

action #166772

open

coordination #102915: [saga][epic] Automated classification of failures

coordination #166655: [epic] openqa-label-known-issues

openqa-label-known-issues overrides size:S

Added by ybonatakis 2 months ago. Updated 8 days ago.

Status:
In Progress
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-09-13
Due date:
2024-11-27 (Due in 6 days)
% Done:

0%

Estimated time:

Description

Observation

https://github.com/os-autoinst/scripts/blob/master/openqa-label-known-issues#L55

if ! curl "${curl_args[@]}" -s "$testurl" -o "$out"; then

Problem is that $out is overridden. Then, in case it doesnt reach the block, the script will continue with the label_on_issues_from_issue_tracker with modified context, when it is expected to be the context of autoinst-log.txt.

Raised on https://github.com/os-autoinst/scripts/pull/342/files#r1745228802

Related #169699 showing

Nov 10 03:30:05 ariel openqa-gru[12854]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Nov 10 03:30:07 ariel openqa-gru[13280]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Nov 10 03:30:24 ariel openqa-gru[13985]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Nov 10 03:30:29 ariel openqa-gru[14252]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Nov 10 03:30:31 ariel openqa-gru[14391]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68
Nov 10 03:30:32 ariel openqa-gru[14597]: /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68

Acceptance criteria

  • AC1: The output file is written to the right location in the reference function.

Suggestions

  • Research what the use of the output file is and how to test/verify this
  • Try to make sense of the code to find out what the wanted behavior is
  • Add unit tests

Related issues 4 (2 open2 closed)

Related to openQA Project - action #165716: [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:MResolvedybonatakis2024-08-23

Actions
Related to openQA Infrastructure - action #169699: [alert] opensuse.org openqa.opensuse.org openqa_minion_jobs_hook_rc_failed minion 'hook failed - see openqa-gru service logs for details'Rejectedokurz2024-11-11

Actions
Related to openQA Infrastructure - action #166721: [alert] Waves of emails due to kex_exchange_identification: Connection closed by remote host errorsFeedbacklivdywan

Actions
Copied to openQA Project - action #169747: Multiple finalize_job_results and hook_script minion jobs per openQA job size:MWorkablemkittler2024-09-13

Actions
Actions #1

Updated by ybonatakis 2 months ago

  • Related to action #165716: [o3] Munin - minion hook failed - /opt/os-autoinst-scripts/openqa-label-known-issues: ERROR: line 68 size:M added
Actions #2

Updated by livdywan 2 months ago

Raised on https://github.com/os-autoinst/scripts/pull/342/files#r1745228802 but i think it is not an issue as the out in the function is a local variable.

What is the goal of this ticket? #166649 covers making the code legible so I wouldn't worry about it here.

Should this be "Complete unit test coverage for openqa-label-known-issues"? Or maybe "Consistent handling of old assets in openqa-label-known-issues"?

Actions #3

Updated by tinita 2 months ago

but i think it is not an issue as the out in the function is a local variable.

True, but the content of the variable is the filename, passed by the caller, and that stays the same, and so it is overwritten.

Actions #4

Updated by ybonatakis 2 months ago

livdywan wrote in #note-2:

Raised on https://github.com/os-autoinst/scripts/pull/342/files#r1745228802 but i think it is not an issue as the out in the function is a local variable.

What is the goal of this ticket? #166649 covers making the code legible so I wouldn't worry about it here.

Should this be "Complete unit test coverage for openqa-label-known-issues"? Or maybe "Consistent handling of old assets in openqa-label-known-issues"?

if the ticket needs more info for the estimation, give me some time to investigate and update. sounds good?

Actions #5

Updated by ybonatakis 2 months ago

  • Description updated (diff)

I verified that the autoinst-logs are override and updated the ticket. I also created a draft https://github.com/os-autoinst/scripts/pull/347

Actions #6

Updated by tinita about 2 months ago

  • Parent task set to #166655
Actions #7

Updated by livdywan 28 days ago

  • Subject changed from openqa-label-known-issues overrides to [timedbox:10h] openqa-label-known-issues overrides size:S
  • Description updated (diff)
  • Status changed from New to Workable
Actions #8

Updated by livdywan 9 days ago

  • Description updated (diff)
Actions #9

Updated by okurz 9 days ago

  • Subject changed from [timedbox:10h] openqa-label-known-issues overrides size:S to openqa-label-known-issues overrides size:S
  • Description updated (diff)
  • Priority changed from Normal to High
  • Target version changed from Tools - Next to Ready

Adding to backlog due to #169699

Actions #10

Updated by okurz 9 days ago

  • Description updated (diff)
Actions #11

Updated by okurz 9 days ago

  • Related to action #169699: [alert] opensuse.org openqa.opensuse.org openqa_minion_jobs_hook_rc_failed minion 'hook failed - see openqa-gru service logs for details' added
Actions #12

Updated by ybonatakis 9 days ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #13

Updated by tinita 9 days ago · Edited

I hope I can help with some research.
I tried to find the minion jobs / openqa jobs for those failures in the logs.
When running openqa-label-known-issues on those jobs, the error cannot be reproduced.
But I found a pattern: all of the failing jobs are on investigate jobs.
(All of them are incompletes, but that is something we knew before already, as the error is in a function that's called when there is no autoinst log)
I couldn't figure out in what cases handle_unreachable would result in that error output.

Here is the list of jobs from yesterday and today:

openqa=> select id, concat('https://openqa.opensuse.org/tests/', args->1), task, started, state from minion_jobs where task = 'hook_script' and created >= '2024-11-10 11:39:00' and created <= '2024-11-12 11:42:00' and notes::varchar like '%hook_rc": 1%' order by started limit 100;
   id    |                  concat                   |    task     |            started            |  state   
---------+-------------------------------------------+-------------+-------------------------------+----------
 4541962 | https://openqa.opensuse.org/tests/4635024 | hook_script | 2024-11-11 11:39:08.434197+00 | finished
 4545150 | https://openqa.opensuse.org/tests/4637425 | hook_script | 2024-11-12 09:29:45.814464+00 | finished
 4545155 | https://openqa.opensuse.org/tests/4637425 | hook_script | 2024-11-12 09:30:05.762508+00 | finished
 4545157 | https://openqa.opensuse.org/tests/4637425 | hook_script | 2024-11-12 09:30:05.863404+00 | finished
 4545165 | https://openqa.opensuse.org/tests/4637439 | hook_script | 2024-11-12 09:30:26.215627+00 | finished
 4545173 | https://openqa.opensuse.org/tests/4637439 | hook_script | 2024-11-12 09:30:45.886272+00 | finished
 4545175 | https://openqa.opensuse.org/tests/4637439 | hook_script | 2024-11-12 09:30:50.763358+00 | finished
 4545183 | https://openqa.opensuse.org/tests/4637440 | hook_script | 2024-11-12 09:31:12.157622+00 | finished
 4545190 | https://openqa.opensuse.org/tests/4637440 | hook_script | 2024-11-12 09:31:25.725176+00 | finished
 4545193 | https://openqa.opensuse.org/tests/4637440 | hook_script | 2024-11-12 09:31:25.903787+00 | finished
 4545204 | https://openqa.opensuse.org/tests/4637442 | hook_script | 2024-11-12 09:32:16.65011+00  | finished
 4545209 | https://openqa.opensuse.org/tests/4637441 | hook_script | 2024-11-12 09:32:32.69126+00  | finished
 4545213 | https://openqa.opensuse.org/tests/4637442 | hook_script | 2024-11-12 09:32:43.312287+00 | finished
 4545214 | https://openqa.opensuse.org/tests/4637442 | hook_script | 2024-11-12 09:32:43.369843+00 | finished
 4545216 | https://openqa.opensuse.org/tests/4637441 | hook_script | 2024-11-12 09:32:52.946281+00 | finished
 4545218 | https://openqa.opensuse.org/tests/4637441 | hook_script | 2024-11-12 09:32:53.106781+00 | finished
 4545224 | https://openqa.opensuse.org/tests/4637455 | hook_script | 2024-11-12 09:33:18.878535+00 | finished
 4545229 | https://openqa.opensuse.org/tests/4637459 | hook_script | 2024-11-12 09:33:33.341963+00 | finished
 4545233 | https://openqa.opensuse.org/tests/4637455 | hook_script | 2024-11-12 09:33:48.173654+00 | finished
 4545235 | https://openqa.opensuse.org/tests/4637455 | hook_script | 2024-11-12 09:33:48.259264+00 | finished
 4545237 | https://openqa.opensuse.org/tests/4637459 | hook_script | 2024-11-12 09:33:57.955449+00 | finished
 4545239 | https://openqa.opensuse.org/tests/4637459 | hook_script | 2024-11-12 09:33:58.084309+00 | finished
 4545257 | https://openqa.opensuse.org/tests/4637465 | hook_script | 2024-11-12 09:35:37.193929+00 | finished
 4545263 | https://openqa.opensuse.org/tests/4637465 | hook_script | 2024-11-12 09:36:11.984583+00 | finished
 4545265 | https://openqa.opensuse.org/tests/4637465 | hook_script | 2024-11-12 09:36:12.078513+00 | finished
 4545267 | https://openqa.opensuse.org/tests/4637475 | hook_script | 2024-11-12 09:36:17.567124+00 | finished
 4545273 | https://openqa.opensuse.org/tests/4637475 | hook_script | 2024-11-12 09:36:51.89106+00  | finished
 4545275 | https://openqa.opensuse.org/tests/4637475 | hook_script | 2024-11-12 09:36:51.983071+00 | finished
 4545277 | https://openqa.opensuse.org/tests/4637481 | hook_script | 2024-11-12 09:36:57.903039+00 | finished
 4545284 | https://openqa.opensuse.org/tests/4637481 | hook_script | 2024-11-12 09:37:31.831233+00 | finished
 4545286 | https://openqa.opensuse.org/tests/4637481 | hook_script | 2024-11-12 09:37:31.940482+00 | finished
 4545290 | https://openqa.opensuse.org/tests/4637471 | hook_script | 2024-11-12 09:39:14.894448+00 | finished
 4545294 | https://openqa.opensuse.org/tests/4637471 | hook_script | 2024-11-12 09:39:48.528856+00 | finished
 4545296 | https://openqa.opensuse.org/tests/4637471 | hook_script | 2024-11-12 09:39:48.607363+00 | finished
 4545298 | https://openqa.opensuse.org/tests/4637485 | hook_script | 2024-11-12 09:39:55.220461+00 | finished
 4545300 | https://openqa.opensuse.org/tests/4637485 | hook_script | 2024-11-12 09:40:29.07404+00  | finished
 4545302 | https://openqa.opensuse.org/tests/4637485 | hook_script | 2024-11-12 09:40:29.14412+00  | finished
 4545304 | https://openqa.opensuse.org/tests/4637486 | hook_script | 2024-11-12 09:40:35.520763+00 | finished
 4545309 | https://openqa.opensuse.org/tests/4637486 | hook_script | 2024-11-12 09:41:09.042989+00 | finished
 4545311 | https://openqa.opensuse.org/tests/4637486 | hook_script | 2024-11-12 09:41:09.121461+00 | finished
 4545313 | https://openqa.opensuse.org/tests/4637487 | hook_script | 2024-11-12 09:41:15.832463+00 | finished
 4545317 | https://openqa.opensuse.org/tests/4637487 | hook_script | 2024-11-12 09:41:43.402739+00 | finished
 4545496 | https://openqa.opensuse.org/tests/4637546 | hook_script | 2024-11-12 11:21:59.522885+00 | finished
 4545498 | https://openqa.opensuse.org/tests/4637552 | hook_script | 2024-11-12 11:22:00.62644+00  | finished
 4545500 | https://openqa.opensuse.org/tests/4637553 | hook_script | 2024-11-12 11:22:02.797964+00 | finished
 4545502 | https://openqa.opensuse.org/tests/4637551 | hook_script | 2024-11-12 11:22:04.694155+00 | finished
 4545535 | https://openqa.opensuse.org/tests/4637558 | hook_script | 2024-11-12 11:41:06.887807+00 | finished
 4545536 | https://openqa.opensuse.org/tests/4637557 | hook_script | 2024-11-12 11:41:06.970634+00 | finished
 4545539 | https://openqa.opensuse.org/tests/4637555 | hook_script | 2024-11-12 11:41:12.210091+00 | finished
 4545325 | https://openqa.opensuse.org/tests/4637359 | hook_script | 2024-11-12 12:28:58.202091+00 | inactive
(50 rows)

I suggest to add some helpful debugging. Those cases are not happening that often, so IMHO it's ok to have some more information in the log, from inside the handle_unreachable function.

edit: Also noteworthy, there is sometimes more than one minion job for the same openqa job.

Actions #14

Updated by ybonatakis 9 days ago

thanks Tina. Also I couldnt reproduce it. I tried with both scenarios (fixing the $out override issue and without) the result is the same

Actions #16

Updated by tinita 9 days ago

  • Copied to action #169747: Multiple finalize_job_results and hook_script minion jobs per openQA job size:M added
Actions #17

Updated by openqa_review 9 days ago

  • Due date set to 2024-11-27

Setting due date based on mean cycle time of SUSE QE Tools

Actions #18

Updated by ybonatakis 8 days ago

I am so confused

❯ dry_run=1 ./openqa-label-known-issues-and-investigate-hook 4639446
openqa-cli (318 /home/iob/Documents/Work/qatools/repos/scripts/_common): Error making API request (jobs/http://openqa.opensuse.org/t4638817): 404 Not Found
{"error_status":404}
Skipping posting investigation comment on original job http://openqa.opensuse.org/t4638817 as it does not exist anymore

But https://openqa.opensuse.org/tests/4638817 does exist

o3 openqa-gru shows constant errors and produced emails

Actions #19

Updated by ybonatakis 8 days ago

I have updated https://github.com/os-autoinst/scripts/pull/347 added some debug output in different scripts

Actions #20

Updated by okurz 8 days ago

  • Priority changed from High to Urgent

we still see frequent emails due to that repeated, making this ticket urgent. ybonatakis please focus on urgency mitigation as a first step.

Actions #21

Updated by tinita 8 days ago

I temporalily disabled /etc/cron.d/os-autoinst-scripts-update-git on o3 and added set -x to handle_unreachable, because I have no clue where the error actually happens.
Set the notification email to mine in /etc/munin/munin.conf for now.
Waiting until we see this again.

Actions #22

Updated by tinita 8 days ago

ybonatakis wrote in #note-18:

I am so confused

❯ dry_run=1 ./openqa-label-known-issues-and-investigate-hook 4639446

Calling openqa-label-known-issues-and-investigate-hook here is not really helpful I think. The error happens in openqa-label-known-issues. Calling the whole hook script will call openqa-investigate before and that does different things if it already processed the job before.
Just calling ./openqa-label-known-issues https://openqa.opensuse.org/tests/4637359 should be enough, but the error is not reproducible, so it could be a temporary network thing.

Actions #23

Updated by tinita 8 days ago

https://github.com/os-autoinst/scripts/pull/342 was never deployed on o3, due to https://progress.opensuse.org/issues/166721#note-15
This explains why we still saw the error in line 68.

The repo is now updated on o3, and the actual bugfix regarding the output file can be worked on. I will enable the cronjob and email notification again.

Actions #24

Updated by tinita 8 days ago

  • Priority changed from Urgent to High
Actions #25

Updated by tinita 8 days ago · Edited

tinita wrote in #note-21:

I temporalily disabled /etc/cron.d/os-autoinst-scripts-update-git on o3 and added set -x to handle_unreachable, because I have no clue where the error actually happens.
Set the notification email to mine in /etc/munin/munin.conf for now.
Waiting until we see this again.

Enabled cronjob, enabled email notification again.
@ybonatakis can continue tomorrow.

Actions #26

Updated by tinita 6 days ago

  • Related to action #166721: [alert] Waves of emails due to kex_exchange_identification: Connection closed by remote host errors added
Actions

Also available in: Atom PDF