Project

General

Profile

Actions

action #95830

closed

Errors from hook scripts in openqa-gru service log ("parse error: Invalid numeric literal at...", JSON error) size:M

Added by tinita about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2021-06-24
Due date:
2021-08-06
% Done:

0%

Estimated time:

Description

Observation

journalctl -u openqa-gru on osd shows:

-- Logs begin at Sun 2021-07-18 03:30:00 CEST, end at Tue 2021-07-20 16:23:49 CEST. --
Jul 20 12:59:01 openqa openqa-gru[22516]: parse error: Invalid numeric literal at line 1, column 10
Jul 20 14:31:27 openqa openqa-gru[22516]: unable to clone job 6490922: it is part of a parallel or directly chained cluster (not supported)
Jul 20 14:31:29 openqa openqa-gru[22516]: unable to clone job 6490922: it is part of a parallel or directly chained cluster (not supported)
Jul 20 14:31:30 openqa openqa-gru[22516]: unable to clone job 6400555: it is part of a parallel or directly chained cluster (not supported)
Jul 20 14:31:31 openqa openqa-gru[22516]: unable to clone job 6400555: it is part of a parallel or directly chained cluster (not supported)
Jul 20 15:40:43 openqa openqa-gru[22516]: parse error: Invalid numeric literal at line 1, column 10
Jul 20 15:40:43 openqa openqa-gru[22516]: /opt/os-autoinst-scripts/openqa-investigate: line 73: echo: write error: Broken pipe

Acceptance criteria

  • AC1: No obvious ther error messages ("parse error: Invalid numeric literal at...") in openqa-gru logs

Suggestions

  • Try to run openqa-investigate scripts manually from github.com/os-autoinst/scripts/ against jobs on osd and check for errors

Further details

openqa-gru runs multiple asynchronous background processes. The ones mentioned above are from "job_post_hooks".


Related issues 2 (0 open2 closed)

Copied from openQA Project - action #94678: Multiple errors from hook scripts in openqa-gru service log (was: Missing openqa-investigate jobs on osd) size:MResolved2021-06-242021-07-23

Actions
Copied to openQA Project - action #96122: Errors from hook scripts in openqa-gru service log ("grep: range out of order in character class") size:MResolvedtinita2021-08-12

Actions
Actions #1

Updated by tinita about 3 years ago

  • Copied from action #94678: Multiple errors from hook scripts in openqa-gru service log (was: Missing openqa-investigate jobs on osd) size:M added
Actions #2

Updated by ilausuch about 3 years ago

  • Subject changed from Errors from hook scripts in openqa-gru service log ("unable to clone job", JSON error) to Errors from hook scripts in openqa-gru service log ("parse error: Invalid numeric literal at...", JSON error) size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #3

Updated by tinita about 3 years ago

  • Status changed from Workable to In Progress
  • Assignee set to tinita
Actions #5

Updated by tinita about 3 years ago

  • Status changed from In Progress to Feedback
Actions #6

Updated by tinita about 3 years ago

PR merged, waiting for being deployed

Actions #7

Updated by okurz about 3 years ago

scripts are deployed near-continuous, maybe every minute or so. So it should be live already on both o3+osd

Actions #8

Updated by tinita about 3 years ago

Ok, so maybe printing the whole output was a bad idea.
The log now says:

Jul 22 21:01:56 openqa openqa-gru[30446]: jq (88 /opt/os-autoinst-scripts/openqa-investigate): parse error: Invalid numeric literal at line 1, column 10 (Input: >>><!DOCTYPE html>
Jul 22 21:01:56 openqa openqa-gru[30446]: <html lang="en">
Jul 22 21:01:56 openqa openqa-gru[30446]:   <head>
Jul 22 21:01:56 openqa openqa-gru[30446]:       <!-- Meta, title, CSS, favicons, etc. -->
Jul 22 21:01:56 openqa openqa-gru[30446]:       <meta charset="utf-8">
Jul 22 21:01:56 openqa openqa-gru[30446]:       <meta http-equiv="X-UA-Compatible" content="IE=edge">
Jul 22 21:01:56 openqa openqa-gru[30446]:       <meta name="viewport" content="width=device-width, initial-scale=1">
Jul 22 21:01:56 openqa openqa-gru[30446]:       <meta name="description" content="openQA is a testing framework mainly for distributions">
Jul 22 21:01:56 openqa openqa-gru[30446]:       <meta name="keywords" content="Testing, Linux, Qemu">
Jul 22 21:01:56 openqa openqa-gru[30446]:       <meta name="author" content="openQA contributors">
Jul 22 21:01:56 openqa openqa-gru[30446]:       <title>openQA: Page not found</title>
...

That's the curl output from:
vars_last_good=$(curl -s "$host_url/tests/$last_good"/file/vars.json)
So I guess we're requesting vars.json for a test which doesn't have a vars.json.

What should we do in such a case? Silently skip it?

Actions #9

Updated by okurz about 3 years ago

Hm, actually interesting case. I would say it's one of many cases when the "not found" Page that we deliver is enormous and should be much much smaller. So maybe we tweak the not found template from openQA, maybe differ if we access as human user over the webui or other ways like here?

Actions #10

Updated by tinita about 3 years ago

I would say we should edit the 404 page in any case. It can simply link to a special page that shows all the routes, and the 404 page should be kept small.

But still, I think it would be better to test for a 404 status before and output a message in this case.

Additionally the jq wrapper should probably just output the first x characters.

Actions #11

Updated by okurz about 3 years ago

yes to all three :)

tinita wrote:

I would say we should edit the 404 page in any case. It can simply link to a special page that shows all the routes, and the 404 page should be kept small.

we could kinda turn around https://openqa.opensuse.org/api_help which is small but links to https://openqa.opensuse.org/not_found to show all routes. I think I was trying to work on the 404 page ages ago to ensure we have an API help page. Maybe at that time I could not find a better solution to just link to the default mojolicious 404 page. Maybe with combined better knowledge now we could generate a dedicated "API route help page", put that on https://openqa.opensuse.org/api_help, and make the 404 tiny instead and link to api_help

Actions #12

Updated by tinita about 3 years ago

First aid to prevent filling our logs and making it hard to read: https://github.com/os-autoinst/scripts/pull/91

Actions #13

Updated by okurz about 3 years ago

  • Due date changed from 2021-07-23 to 2021-08-06

seems like you copied the due-date as well. It's a bit too tight :) So extending to be our usual 14 days in the future

Actions #14

Updated by tinita about 3 years ago

PR for better error handling: https://github.com/os-autoinst/scripts/pull/92

Actions #15

Updated by tinita about 3 years ago

  • Status changed from Feedback to In Progress

PR was merged.

We still see the error message (and others) in the log though:

Jul 27 09:22:15 openqa openqa-gru[2105]: unable to clone job 6578645: it is part of a parallel or directly chained cluster (not supported)
Jul 27 09:49:53 openqa openqa-gru[2105]: grep: range out of order in character class
Jul 27 09:49:53 openqa openqa-gru[2105]: grep: range out of order in character class
Jul 27 09:49:56 openqa openqa-gru[2105]: grep: range out of order in character class
Jul 27 09:50:04 openqa openqa-gru[2105]: parse error: Invalid numeric literal at line 1, column 10
Jul 27 09:50:09 openqa openqa-gru[2105]: parse error: Invalid numeric literal at line 1, column 10
Jul 27 09:50:10 openqa openqa-gru[2105]: parse error: Invalid numeric literal at line 1, column 10
Jul 27 09:50:10 openqa openqa-gru[2105]: grep: range out of order in character class
Jul 27 09:50:17 openqa openqa-gru[2105]: grep: range out of order in character class
Jul 27 09:50:21 openqa openqa-gru[2105]: grep: range out of order in character class
Jul 27 09:51:50 openqa openqa-gru[2105]: grep: exceeded PCRE's backtracking limit
Jul 27 09:51:50 openqa openqa-gru[2105]: grep: range out of order in character class
Jul 27 09:52:11 openqa openqa-gru[2105]: grep: exceeded PCRE's backtracking limit

Also we now see a reason why the curl command returned a 404:

Jul 27 10:04:24 openqa openqa-gru[2105]: curl: Error fetching url (-s https://openqa.suse.de/tests/null/file/vars.json): Got Status 40
Actions #16

Updated by okurz about 3 years ago

tinita wrote:

Jul 27 10:04:24 openqa openqa-gru[2105]: curl: Error fetching url (-s https://openqa.suse.de/tests/null/file/vars.json): Got Status 40

the URL part "null" is likely already coming from jq

Actions #17

Updated by tinita about 3 years ago

okurz wrote:

the URL part "null" is likely already coming from jq

That's pretty obvious, yeah

Actions #18

Updated by tinita about 3 years ago

looks like this line:
https://github.com/os-autoinst/openQA/blob/master/lib/OpenQA/Schema/Result/Jobs.pm#L2019

sub investigate {
...
    $inv{last_good} //= 'not found';

OTOH this is already checked: if [[ "$last_good_exists" = "not found" ]]; then ...

Actions #19

Updated by tinita about 3 years ago

PR for the null URL https://openqa.suse.de/tests/null/file/vars.json: https://github.com/os-autoinst/scripts/pull/94

Actions #20

Updated by tinita about 3 years ago

Regarding the jq error: Do we have a list of the scripts where the output ends up in this log? All scripts in the os-autoinst/scripts repo?

I would suggest to share the functions runcurl and runjq between them.

Actions #21

Updated by tinita about 3 years ago

PR 94 merged, waiting for more informative output in the logs.

Actions #22

Updated by tinita about 3 years ago

Jul 27 17:03:10 openqa openqa-gru[2105]: .last_good.text not found: investigation for test 6584829 returned '{"error":"No previous job in this scenario, cannot provide hints"}'
Actions #24

Updated by tinita about 3 years ago

  • Copied to action #96122: Errors from hook scripts in openqa-gru service log ("grep: range out of order in character class") size:M added
Actions #25

Updated by ilausuch about 3 years ago

During the daily we commented that we'll should wait to see several days of logs to check it

Actions #26

Updated by tinita about 3 years ago

  • Status changed from In Progress to Feedback

Another PR for the remaining jq errors: https://github.com/os-autoinst/scripts/pull/98

(although I don't see any such errors in the logs from the last days)

It also contains general improvements.

Actions #27

Updated by tinita about 3 years ago

  • Status changed from Feedback to Resolved

The PR was merged. Currently I don't see such error messages in the log.
It seems to be rare, so I guess I will resolve the ticket, and if it happens again we will either see a jq error with more information or a message that curl didn't get a 200 back.

Actions

Also available in: Atom PDF