action #174601
closedopenqa-gru.service journal filled with openqa-trigger-bisect-jobs stack traces size:S
0%
Description
Observation¶
While looking into https://progress.opensuse.org/issues/174580 and why openqa-gru failed today, I found a lot and repeated stack traces in gru's journal all looking similar to:
Dec 19 13:26:23 openqa openqa-gru[5066]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16252243 TEST+=:investigate:last_good_tests:7c3e460816d9f4305b288674abdc15d295158b49 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#7c3e460816d9f4305b288674abdc15d295158b49 OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) stderr: >>>Current job 16252243 will fail, because the repositories for the below updates are unavailable<<<
Dec 19 13:26:23 openqa openqa-gru[5066]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16252243 TEST+=:investigate:last_good_tests:7c3e460816d9f4305b288674abdc15d295158b49 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#7c3e460816d9f4305b288674abdc15d295158b49 OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) rc: 255 >>><<<
Dec 19 13:26:24 openqa openqa-gru[5129]: Current job 16252241 will fail, because the repositories for the below updates are unavailable
Dec 19 13:26:24 openqa openqa-gru[5129]: [
Dec 19 13:26:24 openqa openqa-gru[5129]: "http://download.suse.de/ibs/SUSE:/Maintenance:/36747/SUSE_Updates_SLE-Module-Basesystem_15-SP5_x86_64/",
Dec 19 13:26:24 openqa openqa-gru[5129]: ] at /usr/share/openqa/script/../lib/OpenQA/Script/CloneJobSUSE.pm line 39.
Dec 19 13:26:24 openqa openqa-gru[5072]: Traceback (most recent call last):
Dec 19 13:26:24 openqa openqa-gru[5072]: File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 322, in <module>
Dec 19 13:26:24 openqa openqa-gru[5072]: main(parse_args())
Dec 19 13:26:24 openqa openqa-gru[5072]: File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 304, in main
Dec 19 13:26:24 openqa openqa-gru[5072]: args.dry_run,
Dec 19 13:26:24 openqa openqa-gru[5072]: File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 149, in openqa_clone
Dec 19 13:26:24 openqa openqa-gru[5072]: return call(["openqa-clone-job"] + default_opts + cmds + default_cmds, dry_run)
Dec 19 13:26:24 openqa openqa-gru[5072]: File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 114, in call
Dec 19 13:26:24 openqa openqa-gru[5072]: (["echo", "Simulating: "] if dry_run else []) + cmds
Dec 19 13:26:24 openqa openqa-gru[5072]: File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
Dec 19 13:26:24 openqa openqa-gru[5072]: **kwargs).stdout
Dec 19 13:26:24 openqa openqa-gru[5072]: File "/usr/lib64/python3.6/subprocess.py", line 438, in run
Dec 19 13:26:24 openqa openqa-gru[5072]: output=stdout, stderr=stderr)
Dec 19 13:26:24 openqa openqa-gru[5072]: subprocess.CalledProcessError: Command '['openqa-clone-job', '--skip-chained-deps', '--json-output', '--within-instance', 'https://openqa.suse.de/tests/16252241', 'SDK_TEST_REPOS=http://download.suse.de/ibs/SUSE:/Maintenance:/36728/SUSE_Updates_SLE-Module-Development-Tools_15-SP5_x86_64/,http://download.suse.de/ibs/SUSE:/Maintenance:/36797/SUSE_Updates_SLE-Module-Development-Tools_15-SP5_x86_64/,http://download.suse.de/ibs/SUSE:/Maintenance:/36821/SUSE_Updates_SLE-Module-Development-Tools_15-SP5_x86_64/', 'TEST=jeos-containers-podman:investigate:bisect_without_36475', 'OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/tests/16252241', 'MAINT_TEST_REPO=', '_GROUP=0']' returned non-zero exit status 255.
Dec 19 13:26:24 openqa openqa-gru[5158]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16205104 TEST+=:investigate:last_good_build:20241216-1 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) stderr: >>>Current job 16205104 will fail, because the repositories for the below updates are unavailable<<<
Dec 19 13:26:24 openqa openqa-gru[5158]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16205104 TEST+=:investigate:last_good_build:20241216-1 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) rc: 255 >>><<<
Dec 19 13:26:26 openqa openqa-gru[5248]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16205104 TEST+=:investigate:last_good_tests_and_build:7c3e460816d9f4305b288674abdc15d295158b49+20241216-1 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#7c3e460816d9f4305b288674abdc15d295158b49 WORKER_CLASS=svirt-xen,openqaw5-xen,zone-cc,region-prg,datacenter-dc7,location-prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3 OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) stderr: >>>Current job 16205104 will fail, because the repositories for the below updates are unavailable<<<
The first time I see this in logs happening is Dec 19 09:55:21
(I assume this is UTC as it is directly from journalctl on OSD). I'm not sure if this is causing the gru-service to fail but it is certainly something to look into.
Acceptance Criteria¶
- AC1: No Python stacktraces in openqa-gru service logs for known errors
- AC2: openQA users have access to the details from openqa-clone-job, e.g. in the openQA job comment written by openqa-trigger-bisect-jobs
Suggestions¶
- Confirm what is causing Python stacktraces to show up in log files
- Redirect logs to /dev/null
- Catch the error and propagate it to openqa investigate and write a comment on the job
- Be aware that we have openqa-investigate (bash) and openqa-trigger-bisect-jobs (python) both calling openqa-clone-job which in both cases fail the same way
- We need to find a way to propagate the information in both differing implementations with focus on preventing the python stack trace which is only coming from openqa-trigger-bisect-jobs. In the worst case don't fix the bash script as the python stacktrace comes from the python script
Updated by nicksinger 2 months ago
- Copied from action #174580: [FIRING:1] Failed systemd services alert added
Updated by okurz about 2 months ago
- Target version changed from Ready to Tools - Next
Updated by okurz about 1 month ago
- Target version changed from Tools - Next to Ready
Updated by ybonatakis 25 days ago
- Status changed from Workable to In Progress
- Assignee set to ybonatakis
Updated by openqa_review 24 days ago
- Due date set to 2025-02-21
Setting due date based on mean cycle time of SUSE QE Tools
Updated by ybonatakis 24 days ago · Edited
livdywan wrote in #note-9:
What's the current state of this?
- Did you run the unit test?
No yet
- Which jobs if any did you look at to reproduce it?
Random choices among others
1551 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16684753
1552 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16079005
1553 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16365265
1554 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/15839520
1555 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16081737
1556 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/15873161
1557 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/5275480
1558 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/15839456
1559 ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16137625
- What else did you look into?
Some not related stuff. there was a chat on slack. I am told to focus on ./openqa-trigger-bisect-jobs.
I looked into the osd logs and checked subprocess module (also wrote about it on slack). I tried to run debugger and that's all so far
Updated by okurz 24 days ago · Edited
your quoting in the above comment is broken. Please fix that DONE
Regarding the testing approach I recommend to first get tests/test_trigger_bisect_jobs.py to run in your environment, then make it fail by adding according mocked calls of openqa-clone-job returning the error string and triggering the unhandled stack trace, then catch the exception in openqa-trigger-bisect-jobs
Updated by ybonatakis 24 days ago
Updated by ybonatakis 20 days ago
I have added a third commit with the changes following comments suggestions. The test only makes sure that in raising Exception exits and the logs have the logged error. I will have to make a manual test tomorrow and give it for reviews
Updated by ybonatakis 19 days ago
Another commit (draft) https://github.com/os-autoinst/scripts/pull/376
I tried to get the feedback from the unblock meeting in this one, without ignore the comments from https://github.com/os-autoinst/scripts/pull/375.
The PR as it is now is missing the openqa_comment
(AC2). The main reason is because it makes the test fail in the current logic.
Updated by okurz 17 days ago
https://github.com/os-autoinst/scripts/pull/377 was merged with changes from ybonatakis included. You can now focus on AC2. Please use TDD and feel welcome to show us the failing test code before diving into the implementation
Updated by ybonatakis 17 days ago
is it right to assume that the AC2 occurs only on because the repositories for the below updates are unavailable
?
Updated by ybonatakis 14 days ago
https://github.com/os-autoinst/scripts/pull/378 (not-ready) for AC2
Updated by ybonatakis 14 days ago
- Status changed from In Progress to Feedback
ybonatakis wrote in #note-19:
https://github.com/os-autoinst/scripts/pull/378 (not-ready) for AC2
ready for review
Updated by ybonatakis 13 days ago
- Status changed from Feedback to Resolved
writeup:¶
Modify the call
. check_output runs with check
parameter as True. With check enabled, a non-zero exit code will raise an Exception.
Tina replaced check_output
with run
and separating the output for better handling. call
will always return the stdout or an Exception.
The underline problem solved by catching the exception in main, where openqa_clone
is called. The Exception handles a known error[0] with exiting the script
and by raising another stack trace for all the other unknown ones.
Completed the task with sending a comment to the investigated job before exiting the known issue
Tests are enhanced to cover all the above