action #99519
closedInvestigation jobs triggered but missing comment on original job size:M
Description
Observation¶
Similar to #96959 I found a job https://openqa.suse.de/tests/7261950/#comments that has only a (secondary) comment after completion of the retry job but no initial comment linking all investigation jobs which I assume had been triggered
geekotest wrote 2021-09-30 03:13:09 +0000
Investigate retry job: https://openqa.suse.de/t7268550 failed, likely not a sporadic test
OPENQA_INVESTIGATE_ORIGIN https://openqa.suse.de/t7261950
Use of uninitialized value $yast_pid in scalar chomp at sle/lib/y2_base.pm line 45.
y2_base::save_strace_gdb_output(skip_install_addons=HASH(0x1002f4a5c50)) called at sle/lib/y2_installbase.pm line 558
y2_installbase::post_fail_hook(skip_install_addons=HASH(0x1002f4a5c50)) called at /usr/lib/os-autoinst/basetest.pm line 331
Expected result¶
There should be a comment referencing up to four investigation jobs triggered
Suggestions¶
- Look into Apache logs e.g. errors posting a comments
grep 7261950 access_log | grep openqa-investigate
zgrep 7261950 access_log-20210930.xz | grep openqa-investigate
- Retry if comment can't be posted
Updated by okurz about 3 years ago
- Subject changed from Job hooks trigger investigate jobs for passed/soft-failed size:M to Investigation jobs triggered but missing comment on original job
- Priority changed from Normal to High
treating as "High" as there is only a single static job URL so far and we should act before loosing the reference.
Updated by livdywan about 3 years ago
- Subject changed from Investigation jobs triggered but missing comment on original job to Investigation jobs triggered but missing comment on original job size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by tinita about 3 years ago
- Status changed from Workable to In Progress
- Assignee set to tinita
Updated by tinita about 3 years ago
I did some database stats on osd and o3 which suggest that this happens quite often.
o3:
openqa=> select j.id, js.value from job_settings js join jobs j on js.job_id=j.id where key='OPENQA_INVESTIGATE_ORIGIN' and js.t_created >= '2021-09-30 08:00:00' and test like '%:investigate:retry' order by value;
id | value
---------+-------------------------------------
1947052 | http://openqa.opensuse.org/t1945661 <--
1946990 | http://openqa.opensuse.org/t1946900
1946998 | http://openqa.opensuse.org/t1946902
1946969 | http://openqa.opensuse.org/t1946913 <--
1947002 | http://openqa.opensuse.org/t1946994
1947006 | http://openqa.opensuse.org/t1946995
1947010 | http://openqa.opensuse.org/t1946996
1947014 | http://openqa.opensuse.org/t1946997
1947079 | http://openqa.opensuse.org/t1947049
1947054 | http://openqa.opensuse.org/t1947050
1947098 | http://openqa.opensuse.org/t1947059
(11 rows)
openqa=> select job_id from comments where t_created >= '2021-09-30 08:00:00' and text like 'Automatic investigation jobs%'
and text like '%:investigate:retry%' order by job_id;
job_id
---------
1946900
1946902
1946994
1946995
1946996
1946997
1947049
1947050
1947059
(9 rows)
Updated by tinita about 3 years ago
I needed to exclude the cloned jobs.
This is the query I came up with to list jobs which are retries and where the original job didn't get a comment:
select CONCAT('https://openqa.suse.de/t',js.job_id),CAST(substr(js.value, 25) as int) as jobid, c.job_id, substr(c.text, 0, 30) as comment_text from job_settings js join jobs j on js.job_id=j.id left join jobs jc on j.id=jc.clone_id left j
oin comments c on CAST(substr(js.value, 25) as int) =c.job_id where key='OPENQA_INVESTIGATE_ORIGIN' and js.t_created >= '2021-09-30 00:00:00' and j.test like '%:investigate:retry' and jc.clone_id is null and c.job_id is null order by jobid;
Note that you need to adjust the substr
for o3 because of the different url.
I picked one of those jobs in osd and ran openqa-investigate on it, and it posted a comment successfully. So it's not very likely that there is a problem with the logic of the code.
Updated by okurz about 3 years ago
tinita and me continued to investigate a bit further together. Interesting findings:
- There are logfiles /var/log/openqa_gru* which seem to contain the stdout of the gru minion service but neither stderr nor stdout of actual minion job processes. The stderr goes into the journal of the systemd service, see
journalctl -u openqa-gru
, the stdout of minion job processes is put into the "notes" of minion jobs which are visible in the "minion_jobs" database table as well as over the minion dashboard UI - There are warnings in the gru system journal about CloneJob.pm. Proposed fix in https://github.com/os-autoinst/scripts/pull/115
- In the case of success we found that minion jobs record the output of openqa-cli with an openQA comment id that is created, e.g. using SQL
select * from minion_jobs where args::TEXT like '%7280737%';
there is the row content"hook_result": {"id": 391557}
, in case of failure the "hook_result" is empty but there is no non-zero exit code recorded or anything
We have created https://github.com/os-autoinst/scripts/pull/114 for verbose output based on the "verbose" option and with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/597 we have set that option for all hook scripts on OSD.
Now we can monitor the journal of openqa-gru again and see what should appear in lines matching something like openqa-investigate.*out:
What I can see already is:
Sep 30 20:37:19 openqa openqa-gru[821]: 404 Not Found
Sep 30 20:40:11 openqa openqa-gru[821]: 404 Not Found
Sep 30 20:40:32 openqa openqa-gru[821]: 404 Not Found
Sep 30 20:41:04 openqa openqa-gru[821]: 404 Not Found
I think this comes from the actual openqa-cli command that tries to write a comment on jobs and fails.
EDIT: When calling a command openqa-cli api --osd -X POST jobs/1/comments text="test by okurz"
one gets
404 Not Found
{"error":"Job 1 does not exist","error_status":404}
with the first line being stderr and the second stdout. Now I wondered, if we see the line "404 Not Found" in the system journal of openqa-gru I should be able to search for "does not exist" in minion_jobs notes with select * from minion_jobs where notes::TEXT like '%does not exist%' order by started desc limit 1;
and I found
https://openqa.suse.de/minion/jobs?id=3060251
which has as "args": "7274792", so corresponding to the openQA job https://openqa.suse.de/tests/7274792 , which by chance, is failed, has no comment, so "needs investigation". So I assume openqa-investigate was triggered for that.
But in notes we have
{"error":"Job 7274554 does not exist","error_status":404}
Ok, in this example the triggering job is a ":retry" which had a chance to run only some hours after the original job and in this case a user actually deleted the original job. That should be fine.
Updated by tinita about 3 years ago
PR for saving the return code of the hook command: https://github.com/os-autoinst/openQA/pull/4256
It's not obvious from the minion dashboard if a command failed or not.
Updated by openqa_review about 3 years ago
- Due date set to 2021-10-15
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita about 3 years ago
I added some debugging code which echoes something to stderr right before the first clone, and before checking $out
before posting the comment.
For today I found two jobs on osd without comments, 7285740 and 7270784.
In the openqa-gru journal I can see the first debug entry, but not the second, meaning that the script finished before (killed by timeout for example) or simply did not get to this code path.
Updated by tinita about 3 years ago
PR https://github.com/os-autoinst/openQA/pull/4256 (log exit code in hook cmd) was merged.
With that we will hopefully gather more information, for example a timeout would result in an exit code of 124.
Updated by tinita almost 3 years ago
We can now see which hooks failed:
select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where notes::TEXT like '%hook_rc%' and cast(notes->'hook_rc' as int) != 0 order by created;
There are several with rc=31744 which is 31744 << 8 ==
124, so it got a timeout.
created
That matches theand
finished` columns, as the timeout is currently set to 90s.
To get something in stderr, we should catch the TERM signal with trap
.
But IMHO we should also increase the timeout, and we need to create an alert of some kind when it still fails. Currently those failures are ignored
Updated by tinita almost 3 years ago
https://github.com/os-autoinst/openQA/pull/4274 Run hook script timeout in verbose mode
Updated by tinita almost 3 years ago
https://github.com/os-autoinst/openQA/pull/4274 was merged.
New PR:
https://github.com/os-autoinst/scripts/pull/117 Improve error handling
Updated by tinita almost 3 years ago
Short statistics about finalize_job_results minion jobs:
openqa=> select count(*) from minion_jobs where task = 'finalize_job_results' and cast(notes->'hook_rc' as int) = 0 ;
count
-------
1532
(1 row)
openqa=> select count(*) from minion_jobs where task = 'finalize_job_results' and cast(notes->'hook_rc' as int) != 0 ;
count
-------
51
(1 row)
openqa=> select avg(finished - started) from minion_jobs where task = 'finalize_job_results' and cast(notes->'hook_rc' as int) = 0 ;
avg
-----------------
00:00:19.716963
(1 row)
So the large majority of the hooks is fast enough.
The slowest job that was killed due to timeout took 8:29 minutes, and I'm actually wondering why it wasn't killed earlier as we have a timeout of 90s. Maybe my calculation of finished - started
is wrong.
Setting the timeout to 3 minutes should catch almost all occurrences.
Updated by tinita almost 3 years ago
Increase the timeout: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/601
Updated by tinita almost 3 years ago
- Status changed from In Progress to Feedback
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/601 was merged.
Today we have 9 hooks with non-zero exit, 7 of them timeouts, two of them exit code 1 (and 897 successful so far).
I cannot see related error message in the gru journal if I look for the timestamps of the failing hooks.
I found 7 jobs with missing comments in the original, however, the SQL is not good enough, and 4 of them are actually jobs that already have been deleted.
Not sure how I could improve this SQL to only get rows where the original job still exists:
select CONCAT('https://openqa.suse.de/t',js.job_id),CAST(substr(js.value, 25) as int) as jobid, c.id, c.job_id, substr(c.text, 0, 30) as comment_text from job_settings js join jobs j on js.job_id=j.id left join jobs jc on j.id=jc.clone_id left join comments c on CAST(substr(js.value, 25) as int) =c.job_id where key='OPENQA_INVESTIGATE_ORIGIN' and js.t_created >= '2021-10-08 00:00:00' and j.test like '%:investigate:retry' and jc.clone_id is null an
d c.job_id is null order by jobid;
We could increase the timeout even more, maybe 5 minutes?
See the following SQL to get a feeling for how long the hooks take:
select id, args, task, notes->'hook_rc', started, finished, finished-started from minion_jobs where started >= '2021-10-08 00:00:00' and cast(notes->'hook_rc' as int) = 0 order by finished-started
Updated by tinita almost 3 years ago
Increase the timeout to 10 minutes: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/604
Yesterday we still had 42 timeouts, today 16 so far.
edit 2021-10-13: MR was merged
Updated by mkittler almost 3 years ago
Note that I set the timeout (so low) because finalize jobs piled up on OSD. I hope this won't be the case again.
Updated by tinita almost 3 years ago
mkittler wrote:
Note that I set the timeout (so low) because finalize jobs piled up on OSD. I hope this won't be the case again.
Yeah, but we have an individual timeout for grep, which was the problematic command.
Updated by tinita almost 3 years ago
- Status changed from Feedback to Resolved
No timeouts since yesterday 10am, and also no jobs with missing comments since then.
openqa=> select id, args, notes->'hook_rc', finished-started from minion_jobs where task='finalize_job_results' and started >= '2021-10-13 00:00:00' and cast(notes->'hook_rc' as int) = 0 order by finished-started;
...
3175529 | [7396900, null] | 0 | 00:04:04.862248
3173802 | [7397004, null] | 0 | 00:04:06.029641
3186645 | [7409020, null] | 0 | 00:04:18.535297
3173800 | [7397181, null] | 0 | 00:04:37.506425
3187689 | [7410617, null] | 0 | 00:04:40.543086
3181702 | [7404522, null] | 0 | 00:05:09.212596
3186968 | [7409170, null] | 0 | 00:05:10.293948
3175831 | [7397221, null] | 0 | 00:05:14.800126
3171255 | [7371796, null] | 0 | 00:06:05.722159
3186500 | [7408450, null] | 0 | 00:08:16.458306
3186462 | [7408467, null] | 0 | 00:09:01.491331
(1247 rows)
openqa=> select avg(finished - started) from minion_jobs where task = 'finalize_job_results' and cast(notes->'hook_rc' as int) = 0 ;
avg
-----------------
00:00:37.170541