action #179545
closedcoordination #154768: [saga][epic][ux] State-of-art user experience for openQA
coordination #179572: [epic] Improved test reviewer user experience - job dependencies and status
Skipped dependencies with START_DIRECTLY_AFTER_TEST size:M
0%
Description
Observation¶
We noticed that jobs using START_DIRECTLY_AFTER_TEST are skipped.
Here are few examples of the behavior. We can see that system was installed, finished fine, but all its dependencies are skipped.
x86_64: https://openqa.suse.de/tests/17170472#dependencies
aarch64: https://openqa.suse.de/tests/17169747#dependencies, https://openqa.suse.de/tests/17106428#dependencies
PowerVM example: https://openqa.suse.de/tests/17138135#dependencies
- it roughly started one week ago
- restart of installation job doesn't help much
- doesn't matter if is machine in cc zone or out of it
- there were no job group configuration changes related
Unfortunately, this behavior blocks baremetal testing.
Steps to reproduce¶
- Go to https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=ipmi-kernel-rt&test=ltp_kvm&version=15-SP7, find passed+skipped jobs
Acceptance Criteria¶
- AC1: https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Online&machine=ipmi-kernel-rt&test=ltp_kvm&version=15-SP7 are consistently not-skipped
- AC2: Jobs are still not executed if the worker load is too high
Suggestions¶
- Consider the worker load-threshold. This shouldn't make jobs end up as skipped
- directly chained jobs should just wait until the load has settled
- As alternative distribute the worker instances away from grenache which is prone to report a too high load due to how KVM@PowerNV works
Updated by pcervinka about 1 month ago
- Description updated (diff)
- Priority changed from Normal to High
Updated by ybonatakis about 1 month ago
A couple of errors in the minion job https://openqa.suse.de/minion/jobs?id=15036989 (if this is the one)
Updated by okurz about 1 month ago
- Tags set to reactive work
- Project changed from openQA Tests (public) to openQA Project (public)
- Category changed from Bugs in existing tests to Support
- Status changed from New to In Progress
- Assignee set to okurz
- Target version set to Ready
ybonatakis wrote in #note-2:
A couple of errors in the minion job https://openqa.suse.de/minion/jobs?id=15036989 (if this is the one)
https://openqa.suse.de/minion/jobs?id=15036989 does show errors like "START_AFTER_TEST=gnome@64bit not found - check for dependency typos and dependency cycles" but that means that corresponding dependencies are not created at all. But in the aforementioned openQA jobs dependencies are there, only that jobs are skipped.
From the log files when looking into the parent https://openqa.suse.de/tests/17170252#dependencies and one child https://openqa.suse.de/tests/17170469 I see
openqa:/var/log # grep '\(17170252\|17170469\)' openqa_scheduler openqa_gru
openqa_scheduler:[2025-03-27T05:27:46.366179Z] [debug] [pid:19934] Need to schedule 1 parallel jobs for job 17170252 (with priority 50)
openqa_scheduler:[2025-03-27T05:27:46.440076Z] [debug] [pid:19934] [Job#17170252] Prepare for being processed by worker 4033
openqa_scheduler:[2025-03-27T05:27:46.575138Z] [debug] [pid:19934] [Job#17170469] Prepare for being processed by worker 4033
openqa_scheduler:[2025-03-27T05:27:47.240665Z] [debug] [pid:19934] Sent job(s) '17170469, 17170474, 17170471, 17170470, 17170473, 17170252, 17170472' to worker '4033'
openqa_scheduler:[2025-03-27T05:27:50.222853Z] [debug] [pid:19934] Allocated: { job => 17170469, worker => 4033 }
openqa_scheduler:[2025-03-27T05:27:50.223322Z] [debug] [pid:19934] Allocated: { job => 17170252, worker => 4033 }
that all looks ok
Updated by okurz about 1 month ago
https://openqa.suse.de/admin/auditlog does not have any hit for 17170469
But I assume ybonatakis is on the right track. https://openqa.suse.de/minion/jobs?id=15036989 says
- error_messages:
- START_DIRECTLY_AFTER_TEST=ay_prepare_baremetal@ipmi-kernel-rt not found - check
for dependency typos and dependency cycles
job_id: 17170469
and ay_prepare_baremetal is not there. https://openqa.suse.de/admin/productlog?id=2737223 also shows those error messages.
So there might be a problem due to "START_DIRECTLY_AFTER_TEST=ay_prepare_baremetal@ipmi-kernel-rt not found - check for dependency typos and dependency cycles". Do you know about ay_prepare_baremetal? According to https://openqa.suse.de/tests?match=ay_prepare_baremetal the last successful run on another machine was 2025-02-28 but no record of ay_prepare_baremetal@ipmi-kernel-rt at all. There is no recent change that I know of in the scheduling algorithms
Updated by pcervinka about 1 month ago
There were no changes in releated setup and we use START_DIRECTLY_AFTER_TEST=prepare_baremetal,ay_prepare_baremetal all the time and worked fine.
Moreover, when you check skipped kdump test https://openqa.suse.de/tests/17167435#dependencies has only START_DIRECTLY_AFTER_TEST=prepare_baremetal
.
Updated by pcervinka about 1 month ago
okurz wrote in #note-4:
So there might be a problem due to "START_DIRECTLY_AFTER_TEST=ay_prepare_baremetal@ipmi-kernel-rt not found - check for dependency typos and dependency cycles". Do you know about ay_prepare_baremetal? According to https://openqa.suse.de/tests?match=ay_prepare_baremetal the last successful run on another machine was 2025-02-28 but no record of ay_prepare_baremetal@ipmi-kernel-rt at all. There is no recent change that I know of in the scheduling algorithms
ay_prepare_baremetal
is used only for QR validation which is not scheduled often and last build is like month ago.
Updated by pcervinka about 1 month ago
Same issue is also on unarmed aarch64 machine https://openqa.suse.de/tests/17169415#dependencies
System installation and ltp installation was done and rest was skipped.
Also see powervm installation on micro: https://openqa.suse.de/tests/17138135#dependencies, but after job restart it passed https://openqa.suse.de/tests/17162132#dependencies.
Updated by mkittler about 1 month ago
If you like you can assign this ticket to me. I found the "problem":
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Job 17167439 from openqa.suse.de finished - reason: skipped
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Cleaning up for next job
Mar 27 03:17:29 grenache-1 worker[437759]: [warn] [pid:437759] The average load (32.32 27.97 17.46) is exceeding the configured threshold of 25. The worker will temporarily not accept new jobs until the load is low>
Mar 27 03:17:29 grenache-1 worker[437759]: [info] [pid:437759] Skipping job 17167435 from queue because worker is broken (The average load (32.32 27.97 17.46) is exceeding the configured threshold of 25. The worker>
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Stopping job 17167435 from openqa.suse.de: ? - reason: skipped
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] REST-API call: POST "http://openqa.suse.de/api/v1/jobs/17167435/set_done?result=skipped&worker_id=4033"
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Job 17167435 from openqa.suse.de finished - reason: skipped
Mar 27 03:17:29 grenache-1 worker[437759]: [debug] [pid:437759] Cleaning up for next job
Mar 27 03:17:29 grenache-1 worker[437759]: [warn] [pid:437759] The average load (32.32 27.97 17.46) is exceeding the configured threshold of 25.
This should at least be mentioned as "reason". However, maybe it makes more sense to actually continue here despite the failing worker self-check.
Updated by livdywan about 1 month ago
- Copied to action #179563: Provide a reason when tests end up in the skipped state added
Updated by livdywan about 1 month ago
There was also https://openqa.suse.de/tests/17035781#dependencies where the job ended up skipped
Updated by mkittler about 1 month ago
- Status changed from In Progress to Feedback
Updated by okurz about 1 month ago
- Subject changed from Skipped dependencies with START_DIRECTLY_AFTER_TEST to Skipped dependencies with START_DIRECTLY_AFTER_TEST size:M
- Description updated (diff)
- Category changed from Support to Regressions/Crashes
Updated by mkittler about 1 month ago
- Status changed from Feedback to Resolved
The PR has been merged and deployed so jobs shouldn't be skipped anymore like this.
Updated by openqa_review 28 days ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: oscap_bash_cis_hmc
https://openqa.suse.de/tests/17319219#step/oscap_security_guide_setup/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by mkittler 25 days ago ยท Edited
- Status changed from Feedback to Resolved
This was a wrong carry over (on https://openqa.suse.de/tests/17319219) as this job has no skipped tests at all.
The original job (where the bugref was carried over, https://openqa.suse.de/tests/17184086) was only executed one day after the PR had been merged. So I would assume the change wasn't deployed at this time. (I also checked the recent job history and haven't found an instance of the skipping problem anymore.)
So for now I'm resolving this ticket again. I removed the bug references.
Updated by openqa_review about 14 hours ago
- Status changed from Resolved to Feedback
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: oscap_ansible_cis_hmc
https://openqa.suse.de/tests/17618746#step/boot_to_desktop/1
To prevent further reminder comments one of the following options should be followed:
- The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
- The openQA job group is moved to "Released" or "EOL" (End-of-Life)
- The bugref in the openQA scenario is removed or replaced, e.g.
label:wontfix:boo1234
Expect the next reminder at the earliest in 28 days if nothing changes in this ticket.
Updated by mkittler 33 minutes ago
- Status changed from Feedback to Resolved
Looks like a wrong carry over again from a bugref of an old job (before this ticket has been resolved).
So I ran delete from comments where text ilike '%poo#179545%';
which deleted 13 comments.