Project

General

Profile

Actions

action #99519

closed

Investigation jobs triggered but missing comment on original job size:M

Added by okurz over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
Due date:
2021-10-15
% Done:

0%

Estimated time:

Description

Observation

Similar to #96959 I found a job https://openqa.suse.de/tests/7261950/#comments that has only a (secondary) comment after completion of the retry job but no initial comment linking all investigation jobs which I assume had been triggered

geekotest wrote 2021-09-30 03:13:09 +0000
Investigate retry job: https://openqa.suse.de/t7268550 failed, likely not a sporadic test

OPENQA_INVESTIGATE_ORIGIN   https://openqa.suse.de/t7261950

Use of uninitialized value $yast_pid in scalar chomp at sle/lib/y2_base.pm line 45.
    y2_base::save_strace_gdb_output(skip_install_addons=HASH(0x1002f4a5c50)) called at sle/lib/y2_installbase.pm line 558
    y2_installbase::post_fail_hook(skip_install_addons=HASH(0x1002f4a5c50)) called at /usr/lib/os-autoinst/basetest.pm line 331

Expected result

There should be a comment referencing up to four investigation jobs triggered

Suggestions

  • Look into Apache logs e.g. errors posting a comments
    • grep 7261950 access_log | grep openqa-investigate
    • zgrep 7261950 access_log-20210930.xz | grep openqa-investigate
  • Retry if comment can't be posted
Actions #1

Updated by okurz over 2 years ago

  • Subject changed from Job hooks trigger investigate jobs for passed/soft-failed size:M to Investigation jobs triggered but missing comment on original job
  • Priority changed from Normal to High

treating as "High" as there is only a single static job URL so far and we should act before loosing the reference.

Actions #2

Updated by livdywan over 2 years ago

  • Subject changed from Investigation jobs triggered but missing comment on original job to Investigation jobs triggered but missing comment on original job size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by tinita over 2 years ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #4

Updated by tinita over 2 years ago

I did some database stats on osd and o3 which suggest that this happens quite often.

o3:

openqa=> select j.id, js.value from job_settings js join jobs j on js.job_id=j.id where key='OPENQA_INVESTIGATE_ORIGIN' and js.t_created >= '2021-09-30 08:00:00' and test like '%:investigate:retry' order by value;                                 
   id    |                value                
---------+-------------------------------------
 1947052 | http://openqa.opensuse.org/t1945661 <--
 1946990 | http://openqa.opensuse.org/t1946900
 1946998 | http://openqa.opensuse.org/t1946902
 1946969 | http://openqa.opensuse.org/t1946913 <--
 1947002 | http://openqa.opensuse.org/t1946994
 1947006 | http://openqa.opensuse.org/t1946995
 1947010 | http://openqa.opensuse.org/t1946996
 1947014 | http://openqa.opensuse.org/t1946997
 1947079 | http://openqa.opensuse.org/t1947049
 1947054 | http://openqa.opensuse.org/t1947050
 1947098 | http://openqa.opensuse.org/t1947059
(11 rows)

openqa=> select job_id from comments where t_created >= '2021-09-30 08:00:00' and text like 'Automatic investigation jobs%'
 and text like '%:investigate:retry%' order by job_id;
 job_id  
---------
 1946900
 1946902
 1946994
 1946995
 1946996
 1946997
 1947049
 1947050
 1947059
(9 rows)
Actions #5

Updated by tinita over 2 years ago

I needed to exclude the cloned jobs.
This is the query I came up with to list jobs which are retries and where the original job didn't get a comment:

select CONCAT('https://openqa.suse.de/t',js.job_id),CAST(substr(js.value, 25) as int) as jobid, c.job_id, substr(c.text, 0, 30) as comment_text from job_settings js join jobs j on js.job_id=j.id left join jobs jc on j.id=jc.clone_id left j
oin comments c on CAST(substr(js.value, 25) as int) =c.job_id where key='OPENQA_INVESTIGATE_ORIGIN' and js.t_created >= '2021-09-30 00:00:00' and j.test like '%:investigate:retry' and jc.clone_id is null and c.job_id is null order by jobid;

Note that you need to adjust the substr for o3 because of the different url.

I picked one of those jobs in osd and ran openqa-investigate on it, and it posted a comment successfully. So it's not very likely that there is a problem with the logic of the code.

Actions #6

Updated by okurz over 2 years ago

tinita and me continued to investigate a bit further together. Interesting findings:

  • There are logfiles /var/log/openqa_gru* which seem to contain the stdout of the gru minion service but neither stderr nor stdout of actual minion job processes. The stderr goes into the journal of the systemd service, see journalctl -u openqa-gru, the stdout of minion job processes is put into the "notes" of minion jobs which are visible in the "minion_jobs" database table as well as over the minion dashboard UI
  • There are warnings in the gru system journal about CloneJob.pm. Proposed fix in https://github.com/os-autoinst/scripts/pull/115
  • In the case of success we found that minion jobs record the output of openqa-cli with an openQA comment id that is created, e.g. using SQL select * from minion_jobs where args::TEXT like '%7280737%'; there is the row content "hook_result": {"id": 391557}, in case of failure the "hook_result" is empty but there is no non-zero exit code recorded or anything

We have created https://github.com/os-autoinst/scripts/pull/114 for verbose output based on the "verbose" option and with https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/597 we have set that option for all hook scripts on OSD.

Now we can monitor the journal of openqa-gru again and see what should appear in lines matching something like openqa-investigate.*out:

What I can see already is:

Sep 30 20:37:19 openqa openqa-gru[821]: 404 Not Found
Sep 30 20:40:11 openqa openqa-gru[821]: 404 Not Found
Sep 30 20:40:32 openqa openqa-gru[821]: 404 Not Found
Sep 30 20:41:04 openqa openqa-gru[821]: 404 Not Found

I think this comes from the actual openqa-cli command that tries to write a comment on jobs and fails.

EDIT: When calling a command openqa-cli api --osd -X POST jobs/1/comments text="test by okurz" one gets

404 Not Found
{"error":"Job 1 does not exist","error_status":404}

with the first line being stderr and the second stdout. Now I wondered, if we see the line "404 Not Found" in the system journal of openqa-gru I should be able to search for "does not exist" in minion_jobs notes with select * from minion_jobs where notes::TEXT like '%does not exist%' order by started desc limit 1; and I found

https://openqa.suse.de/minion/jobs?id=3060251

which has as "args": "7274792", so corresponding to the openQA job https://openqa.suse.de/tests/7274792 , which by chance, is failed, has no comment, so "needs investigation". So I assume openqa-investigate was triggered for that.
But in notes we have

{"error":"Job 7274554 does not exist","error_status":404}

Ok, in this example the triggering job is a ":retry" which had a chance to run only some hours after the original job and in this case a user actually deleted the original job. That should be fine.

Actions #7

Updated by tinita over 2 years ago

PR for saving the return code of the hook command: https://github.com/os-autoinst/openQA/pull/4256
It's not obvious from the minion dashboard if a command failed or not.

Actions #8

Updated by openqa_review over 2 years ago

  • Due date set to 2021-10-15

Setting due date based on mean cycle time of SUSE QE Tools

Actions #9

Updated by tinita over 2 years ago

I added some debugging code which echoes something to stderr right before the first clone, and before checking $out before posting the comment.

For today I found two jobs on osd without comments, 7285740 and 7270784.

In the openqa-gru journal I can see the first debug entry, but not the second, meaning that the script finished before (killed by timeout for example) or simply did not get to this code path.

Actions #10

Updated by tinita over 2 years ago

PR https://github.com/os-autoinst/openQA/pull/4256 (log exit code in hook cmd) was merged.

With that we will hopefully gather more information, for example a timeout would result in an exit code of 124.

Actions #11

Updated by tinita over 2 years ago

We can now see which hooks failed:

select id, args, notes->'hook_rc', notes->'hook_result', created, finished from minion_jobs where notes::TEXT like '%hook_rc%' and cast(notes->'hook_rc' as int)  != 0 order by created;

There are several with rc=31744 which is 31744 << 8 ==124, so it got a timeout.
That matches the
createdandfinished` columns, as the timeout is currently set to 90s.

To get something in stderr, we should catch the TERM signal with trap.
But IMHO we should also increase the timeout, and we need to create an alert of some kind when it still fails. Currently those failures are ignored

Actions #12

Updated by tinita over 2 years ago

https://github.com/os-autoinst/openQA/pull/4274 Run hook script timeout in verbose mode

Actions #14

Updated by tinita over 2 years ago

Short statistics about finalize_job_results minion jobs:

openqa=> select count(*) from minion_jobs where task = 'finalize_job_results' and  cast(notes->'hook_rc' as int)  = 0 ;
 count 
-------
  1532
(1 row)

openqa=> select count(*) from minion_jobs where task = 'finalize_job_results' and  cast(notes->'hook_rc' as int)  != 0 ;
 count 
-------
    51
(1 row)
openqa=> select avg(finished - started) from minion_jobs where task = 'finalize_job_results' and  cast(notes->'hook_rc' as int)  = 0 ;
       avg       
-----------------
 00:00:19.716963
(1 row)

So the large majority of the hooks is fast enough.

The slowest job that was killed due to timeout took 8:29 minutes, and I'm actually wondering why it wasn't killed earlier as we have a timeout of 90s. Maybe my calculation of finished - started is wrong.

Setting the timeout to 3 minutes should catch almost all occurrences.

Actions #16

Updated by tinita over 2 years ago

  • Status changed from In Progress to Feedback

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/601 was merged.
Today we have 9 hooks with non-zero exit, 7 of them timeouts, two of them exit code 1 (and 897 successful so far).
I cannot see related error message in the gru journal if I look for the timestamps of the failing hooks.

I found 7 jobs with missing comments in the original, however, the SQL is not good enough, and 4 of them are actually jobs that already have been deleted.
Not sure how I could improve this SQL to only get rows where the original job still exists:

select CONCAT('https://openqa.suse.de/t',js.job_id),CAST(substr(js.value, 25) as int) as jobid, c.id, c.job_id, substr(c.text, 0, 30) as comment_text from job_settings js join jobs j on js.job_id=j.id left join jobs jc on j.id=jc.clone_id left join comments c on CAST(substr(js.value, 25) as int) =c.job_id where key='OPENQA_INVESTIGATE_ORIGIN' and js.t_created >= '2021-10-08 00:00:00' and j.test like '%:investigate:retry' and jc.clone_id is null an
d c.job_id is null order by jobid;

We could increase the timeout even more, maybe 5 minutes?
See the following SQL to get a feeling for how long the hooks take:

select id, args, task, notes->'hook_rc', started, finished, finished-started  from minion_jobs where started >= '2021-10-08 00:00:00' and cast(notes->'hook_rc' as int)  = 0 order by finished-started
Actions #17

Updated by tinita over 2 years ago

Increase the timeout to 10 minutes: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/604

Yesterday we still had 42 timeouts, today 16 so far.

edit 2021-10-13: MR was merged

Actions #18

Updated by mkittler over 2 years ago

Note that I set the timeout (so low) because finalize jobs piled up on OSD. I hope this won't be the case again.

Actions #19

Updated by tinita over 2 years ago

mkittler wrote:

Note that I set the timeout (so low) because finalize jobs piled up on OSD. I hope this won't be the case again.

Yeah, but we have an individual timeout for grep, which was the problematic command.

Actions #20

Updated by tinita over 2 years ago

  • Status changed from Feedback to Resolved

No timeouts since yesterday 10am, and also no jobs with missing comments since then.

openqa=> select id, args, notes->'hook_rc', finished-started  from minion_jobs where task='finalize_job_results' and started >= '2021-10-13 00:00:00' and cast(notes->'hook_rc' as int)  = 0 order by finished-started;

...
 3175529 | [7396900, null] | 0        | 00:04:04.862248
 3173802 | [7397004, null] | 0        | 00:04:06.029641
 3186645 | [7409020, null] | 0        | 00:04:18.535297
 3173800 | [7397181, null] | 0        | 00:04:37.506425
 3187689 | [7410617, null] | 0        | 00:04:40.543086
 3181702 | [7404522, null] | 0        | 00:05:09.212596
 3186968 | [7409170, null] | 0        | 00:05:10.293948
 3175831 | [7397221, null] | 0        | 00:05:14.800126
 3171255 | [7371796, null] | 0        | 00:06:05.722159
 3186500 | [7408450, null] | 0        | 00:08:16.458306
 3186462 | [7408467, null] | 0        | 00:09:01.491331
(1247 rows)


openqa=> select avg(finished - started) from minion_jobs where task = 'finalize_job_results' and  cast(notes->'hook_rc' as int)  = 0 ;
       avg       
-----------------
 00:00:37.170541

Actions

Also available in: Atom PDF