Project

General

Profile

Actions

action #174601

closed

openqa-gru.service journal filled with openqa-trigger-bisect-jobs stack traces size:S

Added by nicksinger 2 months ago. Updated 13 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
2025-02-21
% Done:

0%

Estimated time:

Description

Observation

While looking into https://progress.opensuse.org/issues/174580 and why openqa-gru failed today, I found a lot and repeated stack traces in gru's journal all looking similar to:

Dec 19 13:26:23 openqa openqa-gru[5066]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16252243 TEST+=:investigate:last_good_tests:7c3e460816d9f4305b288674abdc15d295158b49 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#7c3e460816d9f4305b288674abdc15d295158b49 OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) stderr: >>>Current job 16252243 will fail, because the repositories for the below updates are unavailable<<<

Dec 19 13:26:23 openqa openqa-gru[5066]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16252243 TEST+=:investigate:last_good_tests:7c3e460816d9f4305b288674abdc15d295158b49 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#7c3e460816d9f4305b288674abdc15d295158b49 OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) rc: 255 >>><<<

Dec 19 13:26:24 openqa openqa-gru[5129]: Current job 16252241 will fail, because the repositories for the below updates are unavailable
Dec 19 13:26:24 openqa openqa-gru[5129]: [
Dec 19 13:26:24 openqa openqa-gru[5129]:   "http://download.suse.de/ibs/SUSE:/Maintenance:/36747/SUSE_Updates_SLE-Module-Basesystem_15-SP5_x86_64/",
Dec 19 13:26:24 openqa openqa-gru[5129]: ] at /usr/share/openqa/script/../lib/OpenQA/Script/CloneJobSUSE.pm line 39.
Dec 19 13:26:24 openqa openqa-gru[5072]: Traceback (most recent call last):
Dec 19 13:26:24 openqa openqa-gru[5072]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 322, in <module>
Dec 19 13:26:24 openqa openqa-gru[5072]:     main(parse_args())
Dec 19 13:26:24 openqa openqa-gru[5072]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 304, in main
Dec 19 13:26:24 openqa openqa-gru[5072]:     args.dry_run,
Dec 19 13:26:24 openqa openqa-gru[5072]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 149, in openqa_clone
Dec 19 13:26:24 openqa openqa-gru[5072]:     return call(["openqa-clone-job"] + default_opts + cmds + default_cmds, dry_run)
Dec 19 13:26:24 openqa openqa-gru[5072]:   File "/opt/os-autoinst-scripts/openqa-trigger-bisect-jobs", line 114, in call
Dec 19 13:26:24 openqa openqa-gru[5072]:     (["echo", "Simulating: "] if dry_run else []) + cmds
Dec 19 13:26:24 openqa openqa-gru[5072]:   File "/usr/lib64/python3.6/subprocess.py", line 356, in check_output
Dec 19 13:26:24 openqa openqa-gru[5072]:     **kwargs).stdout
Dec 19 13:26:24 openqa openqa-gru[5072]:   File "/usr/lib64/python3.6/subprocess.py", line 438, in run
Dec 19 13:26:24 openqa openqa-gru[5072]:     output=stdout, stderr=stderr)
Dec 19 13:26:24 openqa openqa-gru[5072]: subprocess.CalledProcessError: Command '['openqa-clone-job', '--skip-chained-deps', '--json-output', '--within-instance', 'https://openqa.suse.de/tests/16252241', 'SDK_TEST_REPOS=http://download.suse.de/ibs/SUSE:/Maintenance:/36728/SUSE_Updates_SLE-Module-Development-Tools_15-SP5_x86_64/,http://download.suse.de/ibs/SUSE:/Maintenance:/36797/SUSE_Updates_SLE-Module-Development-Tools_15-SP5_x86_64/,http://download.suse.de/ibs/SUSE:/Maintenance:/36821/SUSE_Updates_SLE-Module-Development-Tools_15-SP5_x86_64/', 'TEST=jeos-containers-podman:investigate:bisect_without_36475', 'OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/tests/16252241', 'MAINT_TEST_REPO=', '_GROUP=0']' returned non-zero exit status 255.

Dec 19 13:26:24 openqa openqa-gru[5158]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16205104 TEST+=:investigate:last_good_build:20241216-1 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) stderr: >>>Current job 16205104 will fail, because the repositories for the below updates are unavailable<<<

Dec 19 13:26:24 openqa openqa-gru[5158]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16205104 TEST+=:investigate:last_good_build:20241216-1 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) rc: 255 >>><<<

Dec 19 13:26:26 openqa openqa-gru[5248]: openqa-clone-job (81 /opt/os-autoinst-scripts/openqa-investigate): (openqa-clone-job --json-output --skip-chained-deps --max-depth 0 --parental-inheritance --within-instance https://openqa.suse.de/tests/16205104 TEST+=:investigate:last_good_tests_and_build:7c3e460816d9f4305b288674abdc15d295158b49+20241216-1 _TRIGGER_JOB_DONE_HOOK=1 _GROUP_ID=0 BUILD= CASEDIR=https://github.com/os-autoinst/os-autoinst-distri-opensuse.git#7c3e460816d9f4305b288674abdc15d295158b49 WORKER_CLASS=svirt-xen,openqaw5-xen,zone-cc,region-prg,datacenter-dc7,location-prg2,worker35,cpu-x86_64,cpu-x86_64-v2,cpu-x86_64-v3 OPENQA_INVESTIGATE_ORIGIN=https://openqa.suse.de/t16252243) stderr: >>>Current job 16205104 will fail, because the repositories for the below updates are unavailable<<<

The first time I see this in logs happening is Dec 19 09:55:21 (I assume this is UTC as it is directly from journalctl on OSD). I'm not sure if this is causing the gru-service to fail but it is certainly something to look into.

Acceptance Criteria

  • AC1: No Python stacktraces in openqa-gru service logs for known errors
  • AC2: openQA users have access to the details from openqa-clone-job, e.g. in the openQA job comment written by openqa-trigger-bisect-jobs

Suggestions

  • Confirm what is causing Python stacktraces to show up in log files
  • Redirect logs to /dev/null
  • Catch the error and propagate it to openqa investigate and write a comment on the job
  • Be aware that we have openqa-investigate (bash) and openqa-trigger-bisect-jobs (python) both calling openqa-clone-job which in both cases fail the same way
  • We need to find a way to propagate the information in both differing implementations with focus on preventing the python stack trace which is only coming from openqa-trigger-bisect-jobs. In the worst case don't fix the bash script as the python stacktrace comes from the python script

Related issues 1 (0 open1 closed)

Copied from openQA Infrastructure (public) - action #174580: [FIRING:1] Failed systemd services alertResolvednicksinger2024-11-19

Actions
Actions #1

Updated by nicksinger 2 months ago

  • Copied from action #174580: [FIRING:1] Failed systemd services alert added
Actions #2

Updated by jbaier_cz 2 months ago

Hmm.. is that a combination of investigation jobs which are cloning a failed test and the fail-early feature of CloneJobSUSE which looks if all IBS folders are available and fails the cloning process if not?

Actions #3

Updated by okurz about 2 months ago

  • Target version changed from Ready to Tools - Next
Actions #4

Updated by okurz about 1 month ago

  • Target version changed from Tools - Next to Ready
Actions #5

Updated by livdywan 26 days ago

  • Subject changed from openqa-gru.service journal filled with openqa-trigger-bisect-jobs stack traces to openqa-gru.service journal filled with openqa-trigger-bisect-jobs stack traces size:S
  • Description updated (diff)
Actions #6

Updated by livdywan 26 days ago

  • Status changed from New to Workable
Actions #7

Updated by ybonatakis 25 days ago

  • Status changed from Workable to In Progress
  • Assignee set to ybonatakis
Actions #8

Updated by openqa_review 24 days ago

  • Due date set to 2025-02-21

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by livdywan 24 days ago

What's the current state of this?

  • Did you run the unit test?
  • Which jobs if any did you look at to reproduce it?
  • What else did you look into?
Actions #10

Updated by ybonatakis 24 days ago · Edited

livdywan wrote in #note-9:

What's the current state of this?

  • Did you run the unit test?

No yet

  • Which jobs if any did you look at to reproduce it?
Random choices among others
 1551  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16684753
 1552  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16079005
 1553  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16365265
 1554  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/15839520
 1555  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16081737
 1556  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/15873161
 1557  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/5275480
 1558  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/15839456
 1559  ./openqa-trigger-bisect-jobs -vvvvv --dry-run --url https://openqa.suse.de/tests/16137625
  • What else did you look into?

Some not related stuff. there was a chat on slack. I am told to focus on ./openqa-trigger-bisect-jobs.
I looked into the osd logs and checked subprocess module (also wrote about it on slack). I tried to run debugger and that's all so far

Actions #11

Updated by okurz 24 days ago · Edited

your quoting in the above comment is broken. Please fix that DONE

Regarding the testing approach I recommend to first get tests/test_trigger_bisect_jobs.py to run in your environment, then make it fail by adding according mocked calls of openqa-clone-job returning the error string and triggering the unhandled stack trace, then catch the exception in openqa-trigger-bisect-jobs

Actions #13

Updated by ybonatakis 20 days ago

I have added a third commit with the changes following comments suggestions. The test only makes sure that in raising Exception exits and the logs have the logged error. I will have to make a manual test tomorrow and give it for reviews

Actions #14

Updated by ybonatakis 19 days ago

Another commit (draft) https://github.com/os-autoinst/scripts/pull/376
I tried to get the feedback from the unblock meeting in this one, without ignore the comments from https://github.com/os-autoinst/scripts/pull/375.

The PR as it is now is missing the openqa_comment (AC2). The main reason is because it makes the test fail in the current logic.

Actions #15

Updated by tinita 18 days ago

  • Description updated (diff)
Actions #16

Updated by okurz 17 days ago

https://github.com/os-autoinst/scripts/pull/377 was merged with changes from ybonatakis included. You can now focus on AC2. Please use TDD and feel welcome to show us the failing test code before diving into the implementation

Actions #17

Updated by ybonatakis 17 days ago

is it right to assume that the AC2 occurs only on because the repositories for the below updates are unavailable?

Actions #18

Updated by tinita 17 days ago

ybonatakis wrote in #note-17:

is it right to assume that the AC2 occurs only on because the repositories for the below updates are unavailable?

yes

Actions #20

Updated by ybonatakis 14 days ago

  • Status changed from In Progress to Feedback

ybonatakis wrote in #note-19:

https://github.com/os-autoinst/scripts/pull/378 (not-ready) for AC2

ready for review

Actions #21

Updated by ybonatakis 13 days ago

  • Status changed from Feedback to Resolved

writeup:

Modify the call. check_output runs with check parameter as True. With check enabled, a non-zero exit code will raise an Exception.
Tina replaced check_output with run and separating the output for better handling. call will always return the stdout or an Exception.

The underline problem solved by catching the exception in main, where openqa_clone is called. The Exception handles a known error[0] with exiting the script
and by raising another stack trace for all the other unknown ones.

Completed the task with sending a comment to the investigated job before exiting the known issue

Tests are enhanced to cover all the above

[0] https://github.com/os-autoinst/openQA/blob/6fbd2ed11e352b22101746bd00c21086fbf0300b/lib/OpenQA/Script/CloneJobSUSE.pm#L39

Actions

Also available in: Atom PDF