Project

General

Profile

Actions

action #179545

closed

coordination #154768: [saga][epic][ux] State-of-art user experience for openQA

coordination #179572: [epic] Improved test reviewer user experience - job dependencies and status

Skipped dependencies with START_DIRECTLY_AFTER_TEST size:M

Added by pcervinka about 1 month ago. Updated 33 minutes ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2025-03-27
Due date:
% Done:

0%

Estimated time:

Description

Observation

We noticed that jobs using START_DIRECTLY_AFTER_TEST are skipped.

Here are few examples of the behavior. We can see that system was installed, finished fine, but all its dependencies are skipped.

x86_64: https://openqa.suse.de/tests/17170472#dependencies
aarch64: https://openqa.suse.de/tests/17169747#dependencies, https://openqa.suse.de/tests/17106428#dependencies
PowerVM example: https://openqa.suse.de/tests/17138135#dependencies

  • it roughly started one week ago
  • restart of installation job doesn't help much
  • doesn't matter if is machine in cc zone or out of it
  • there were no job group configuration changes related

Unfortunately, this behavior blocks baremetal testing.

Steps to reproduce

Acceptance Criteria

Suggestions

  • Consider the worker load-threshold. This shouldn't make jobs end up as skipped
  • directly chained jobs should just wait until the load has settled
  • As alternative distribute the worker instances away from grenache which is prone to report a too high load due to how KVM@PowerNV works

Related issues 1 (1 open0 closed)

Copied to openQA Project (public) - action #179563: Provide a reason when tests end up in the skipped stateNew2025-03-27

Actions
Actions #1

Updated by pcervinka about 1 month ago

  • Description updated (diff)
  • Priority changed from Normal to High
Actions #2

Updated by ybonatakis about 1 month ago

A couple of errors in the minion job https://openqa.suse.de/minion/jobs?id=15036989 (if this is the one)

Actions #3

Updated by okurz about 1 month ago

  • Tags set to reactive work
  • Project changed from openQA Tests (public) to openQA Project (public)
  • Category changed from Bugs in existing tests to Support
  • Status changed from New to In Progress
  • Assignee set to okurz
  • Target version set to Ready

ybonatakis wrote in #note-2:

A couple of errors in the minion job https://openqa.suse.de/minion/jobs?id=15036989 (if this is the one)

https://openqa.suse.de/minion/jobs?id=15036989 does show errors like "START_AFTER_TEST=gnome@64bit not found - check for dependency typos and dependency cycles" but that means that corresponding dependencies are not created at all. But in the aforementioned openQA jobs dependencies are there, only that jobs are skipped.

From the log files when looking into the parent https://openqa.suse.de/tests/17170252#dependencies and one child https://openqa.suse.de/tests/17170469 I see

openqa:/var/log # grep '\(17170252\|17170469\)' openqa_scheduler openqa_gru
openqa_scheduler:[2025-03-27T05:27:46.366179Z] [debug] [pid:19934] Need to schedule 1 parallel jobs for job 17170252 (with priority 50)
openqa_scheduler:[2025-03-27T05:27:46.440076Z] [debug] [pid:19934] [Job#17170252] Prepare for being processed by worker 4033
openqa_scheduler:[2025-03-27T05:27:46.575138Z] [debug] [pid:19934] [Job#17170469] Prepare for being processed by worker 4033
openqa_scheduler:[2025-03-27T05:27:47.240665Z] [debug] [pid:19934] Sent job(s) '17170469, 17170474, 17170471, 17170470, 17170473, 17170252, 17170472' to worker '4033'
openqa_scheduler:[2025-03-27T05:27:50.222853Z] [debug] [pid:19934] Allocated: { job => 17170469, worker => 4033 }
openqa_scheduler:[2025-03-27T05:27:50.223322Z] [debug] [pid:19934] Allocated: { job => 17170252, worker => 4033 }

that all looks ok

Actions #4

Updated by okurz about 1 month ago

https://openqa.suse.de/admin/auditlog does not have any hit for 17170469

But I assume ybonatakis is on the right track. https://openqa.suse.de/minion/jobs?id=15036989 says

  - error_messages:
    - START_DIRECTLY_AFTER_TEST=ay_prepare_baremetal@ipmi-kernel-rt not found - check
      for dependency typos and dependency cycles
    job_id: 17170469

and ay_prepare_baremetal is not there. https://openqa.suse.de/admin/productlog?id=2737223 also shows those error messages.

So there might be a problem due to "START_DIRECTLY_AFTER_TEST=ay_prepare_baremetal@ipmi-kernel-rt not found - check for dependency typos and dependency cycles". Do you know about ay_prepare_baremetal? According to https://openqa.suse.de/tests?match=ay_prepare_baremetal the last successful run on another machine was 2025-02-28 but no record of ay_prepare_baremetal@ipmi-kernel-rt at all. There is no recent change that I know of in the scheduling algorithms

Actions #5

Updated by pcervinka about 1 month ago

There were no changes in releated setup and we use START_DIRECTLY_AFTER_TEST=prepare_baremetal,ay_prepare_baremetal all the time and worked fine.

Moreover, when you check skipped kdump test https://openqa.suse.de/tests/17167435#dependencies has only START_DIRECTLY_AFTER_TEST=prepare_baremetal.

Actions #6

Updated by pcervinka about 1 month ago

okurz wrote in #note-4:

So there might be a problem due to "START_DIRECTLY_AFTER_TEST=ay_prepare_baremetal@ipmi-kernel-rt not found - check for dependency typos and dependency cycles". Do you know about ay_prepare_baremetal? According to https://openqa.suse.de/tests?match=ay_prepare_baremetal the last successful run on another machine was 2025-02-28 but no record of ay_prepare_baremetal@ipmi-kernel-rt at all. There is no recent change that I know of in the scheduling algorithms

ay_prepare_baremetal is used only for QR validation which is not scheduled often and last build is like month ago.

Actions #7

Updated by pcervinka about 1 month ago

Same issue is also on unarmed aarch64 machine https://openqa.suse.de/tests/17169415#dependencies

System installation and ltp installation was done and rest was skipped.

Also see powervm installation on micro: https://openqa.suse.de/tests/17138135#dependencies, but after job restart it passed https://openqa.suse.de/tests/17162132#dependencies.

Actions #8

Updated by mkittler about 1 month ago

If you like you can assign this ticket to me. I found the "problem":

Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Job 17167439 from openqa.suse.de finished - reason: skipped
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Cleaning up for next job
Mar 27 03:17:29 grenache-1 worker[437759]: [warn] [pid:437759] The average load (32.32 27.97 17.46) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is low>
Mar 27 03:17:29 grenache-1 worker[437759]: [info] [pid:437759] Skipping job 17167435 from queue because worker is broken (The average load (32.32 27.97 17.46) is exceeding the configured threshold of 25. The worker>
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Stopping job 17167435 from openqa.suse.de: ? - reason: skipped
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/17167435/set_done?result=skipped&worker_id=4033"
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Job 17167435 from openqa.suse.de finished - reason: skipped
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Cleaning up for next job
Mar 27 03:17:29 grenache-1 worker[437759]: [warn] [pid:437759] The average load (32.32 27.97 17.46) is exceeding the configured threshold of 25.

This should at least be mentioned as "reason". However, maybe it makes more sense to actually continue here despite the failing worker self-check.

Actions #9

Updated by livdywan about 1 month ago

  • Copied to action #179563: Provide a reason when tests end up in the skipped state added
Actions #10

Updated by okurz about 1 month ago

  • Assignee changed from okurz to mkittler
Actions #11

Updated by livdywan about 1 month ago

There was also https://openqa.suse.de/tests/17035781#dependencies where the job ended up skipped

Actions #12

Updated by okurz about 1 month ago

  • Parent task set to #179572
Actions #13

Updated by mkittler about 1 month ago

  • Status changed from In Progress to Feedback
Actions #14

Updated by okurz about 1 month ago

  • Subject changed from Skipped dependencies with START_DIRECTLY_AFTER_TEST to Skipped dependencies with START_DIRECTLY_AFTER_TEST size:M
  • Description updated (diff)
  • Category changed from Support to Regressions/Crashes
Actions #15

Updated by mkittler about 1 month ago

  • Status changed from Feedback to Resolved

The PR has been merged and deployed so jobs shouldn't be skipped anymore like this.

Actions #16

Updated by openqa_review 28 days ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: oscap_bash_cis_hmc
https://openqa.suse.de/tests/17319219#step/oscap_security_guide_setup/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #17

Updated by mkittler 25 days ago ยท Edited

  • Status changed from Feedback to Resolved

This was a wrong carry over (on https://openqa.suse.de/tests/17319219) as this job has no skipped tests at all.

The original job (where the bugref was carried over, https://openqa.suse.de/tests/17184086) was only executed one day after the PR had been merged. So I would assume the change wasn't deployed at this time. (I also checked the recent job history and haven't found an instance of the skipping problem anymore.)

So for now I'm resolving this ticket again. I removed the bug references.

Actions #18

Updated by openqa_review about 14 hours ago

  • Status changed from Resolved to Feedback

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: oscap_ansible_cis_hmc
https://openqa.suse.de/tests/17618746#step/boot_to_desktop/1

To prevent further reminder comments one of the following options should be followed:

  1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
  2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
  3. The bugref in the openQA scenario is removed or replaced, e.g. label:wontfix:boo1234

Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.

Actions #19

Updated by mkittler 33 minutes ago

  • Status changed from Feedback to Resolved

Looks like a wrong carry over again from a bugref of an old job (before this ticket has been resolved).

So I ran delete from comments where text ilike '%poo#179545%'; which deleted 13 comments.

Actions

Also available in: Atom PDF