action #176418
closedcoordination #102915: [saga][epic] Automated classification of failures
QA (public) - coordination #94105: [epic] Use feedback from openqa-investigate to automatically inform on github pull requests, open tickets, weed out automatically failed tests
last_good_tests_and_build is not triggered even though matching worker instance seems to be free and 0 jobs running due to jobs as part of parallel clusters
0%
Description
Observation¶
https://openqa.suse.de/tests/16623840 "sle-15-SP6-Full-QR-s390x-cc_audit_remote_server:investigate:last_good_tests_and_build:2568d56c0376bcc652e9b26c62dd13547e66715d+117.1@s390x-kvm" has worker class
s390-kvm,s390-kvm-sle12-mm,s390zl13,s390kvm103,zone-cc,region-prg,datacenter-dc7,location-prg2,worker33,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3
which should match https://openqa.suse.de/admin/workers/2655 which has worker class
s390-kvm,s390-kvm-sle12-mm,s390zl13,s390kvm103,zone-cc,region-prg,datacenter-dc7,location-prg2,worker33,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3
and currently there are no jobs running. Still the job isn't picked up.
From openqa_scheduler log:
[2025-02-01T12:25:24.753219Z] [debug] [pid:8830] Need to schedule 2 parallel jobs for job 16623840 (with priority 150)
So there are two jobs: https://openqa.suse.de/tests/16623840 and https://openqa.suse.de/tests/16623841
The problem is, there is only one instance which has the s390kvm103
part of the WORKER_CLASS, https://openqa.suse.de/admin/workers/2655 . last_good_build_and_test deliberately triggers on that exact worker combination so it's intended that this is just one instance. We could loosen that requirement for parallel clusters
Acceptance criteria¶
- AC1: last_good_build_and_test tests are able to be executed with a sensible worker class selection also for parallel cluster jobs
Suggestions¶
- We could check via the jobs api (which we call already) if there are dependencies and then just skip that worker restriction. It would probably be possible to run only the actual test on that same worker class combination by setting the new WORKER_CLASS only for that test with the :$job_id feature and any parallel sibling more relaxed. See https://open.qa/docs/#_spawning_single_new_jobs_jobs_post
- Start within https://github.com/os-autoinst/scripts/blob/386ec25180257858ca8f15303cec08a31ae4b23d/openqa-investigate#L68 and try to extend with the suggestion above
- As an alternative consider to add an option in openqa-clone-job to allow copying values from
vars.json
for certain variables (but do it for each dependant job individually).
Updated by tinita 28 days ago ยท Edited
From openqa_scheduler log:
[2025-02-01T12:25:24.753219Z] [debug] [pid:8830] Need to schedule 2 parallel jobs for job 16623840 (with priority 150)
So there are two jobs: https://openqa.suse.de/tests/16623840 and https://openqa.suse.de/tests/16623841
The problem is, there is only one instance which has the s390kvm103
part of the WORKER_CLASS, https://openqa.suse.de/admin/workers/2655
Updated by tinita 23 days ago
We could loosen that requirement for parallel clusters
Yeah, we could check via the jobs api (which we call already) if there are dependencies and then just skip that worker restriction:
% openqa-cli api --osd jobs/16622911
...
"parents": {
"Chained": [
16622902
],
"Directly chained": [],
"Parallel": [
16622906
]
},
It would probably be possible to run only the actual test on that same worker by setting the new WORKER_CLASS only for that test with the :1
feature?
Updated by okurz 23 days ago
- Subject changed from last_good_tests_and_build is not triggered even though matching worker instance seems to be free and 0 jobs running to last_good_tests_and_build is not triggered even though matching worker instance seems to be free and 0 jobs running due to jobs as part of parallel clusters
- Description updated (diff)
- Category changed from Regressions/Crashes to Feature requests
We could not complete estimation and need to reconsider
Updated by tinita 18 days ago
- Status changed from Resolved to Workable
I just found this in the osd gru journal:
Feb 10 14:05:23 openqa openqa-gru[31894]: openqa-clone-job (83 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16675956 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#99328f722c5266d87384cb7ffb78ab18bc7fba33 WORKER_CLASS:wsl2-main+systemd=qemu_x86_64,qemu_x86_64_staging,qemu_x86_64-large-mem,tap_secondary,windows11,wsl2,platform_intel,zone-cc,region-prg,datacenter-prg1,location-prg_office,openqaworker14,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3,cpu-x86_64-v4 TEST+=:investigate:last_good_tests_and_build:99328f722c5266d87384cb7ffb78ab18bc7fba33+3.76 OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16710107) stderr: >>>command-line argument 'WORKER_CLASS:wsl2-main+systemd=qemu_x86_64,qemu_x86_64_staging,qemu_x86_64-large-mem,tap_secondary,windows11,wsl2,platform_intel,zone-cc,region-prg,datacenter-prg1,location-prg_office,openqaworker14,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3,cpu-x86_64-v4' is no valid setting and will be ignored<<<
WORKER_CLASS:wsl2-main+systemd=...
It seems the +
in the test name isn't expected:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Script/CloneJob.pm#L42
Updated by okurz 18 days ago
- Copied to action #176886: A "+" and other characters used in test names in $var are considered invalid in WORKER_CLASS:$var size:S added