action #111590
closed[alert] HPC jobs not picked up for multiple days, job age alert triggered
Updated by kraih over 2 years ago
It looks like the situation has changed a bit since the ticket was opened. Now i see this job as cancelled, and the parent job as failed (with what looks like a normal test failure).
Updated by kraih over 2 years ago
- Assignee deleted (
kraih)
Putting the ticket back into the queue for estimation, since the expected result here is not quite clear to me.
Updated by kraih over 2 years ago
Btw. Most of the cancellations do not have a scheduled for more than X days
reason. So it wasn't triggered by the max_job_scheduled_time
limit in the scheduler.
Updated by livdywan over 2 years ago
The way I read the description, this is the situation:
- I see a lot of old, cancelled jobs i.e. 12 days old and this is also true for the children of the clones of the parent
- The parent failed due to network issues. This shouldn't be related to cancelled children.
- I'm assuming the jobs were manually cancelled. @okurz suggests this was automatic. Either way it means the jobs were never run.
Updated by kraih over 2 years ago
I would suggest adding cancellation reasons everywhere, to make similar issues easier to investigate.
Updated by livdywan over 2 years ago
Drafted a PR adding reasons to other cases: https://github.com/os-autoinst/openQA/pull/4681
And I see why Oli mentioned the prio. This would affect what's now called cancelled based on job settings via API call
in my PR.
Updated by livdywan over 2 years ago
- Status changed from New to In Progress
- Assignee set to livdywan
I still want to estimate this to make sure we understand the issue, but taking it now since it's Urgent
Updated by openqa_review over 2 years ago
- Due date set to 2022-06-14
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 2 years ago
- Status changed from In Progress to Feedback
cdywan wrote:
Drafted a PR adding reasons to other cases: https://github.com/os-autoinst/openQA/pull/4681
PR got merged.
the culprit seems to be in particular "hpc" test scenarios
Down to 85 scheduled hpc scenarios right now (I don't know how to link a web UI query with the same meaning).
The number has gone down steadily without further action from my end besides monitoring it.
Updated by livdywan over 2 years ago
Unpaused. I see no reason for it to be paused currently.
Updated by livdywan over 2 years ago
- Description updated (diff)
- Status changed from Feedback to Resolved
No issues with hpc tests currently, asked in #eng-testing concerning the 3 🥧️ tests currently stuck because the machine is offline, which is unrelated to this ticket.
Updated by okurz over 2 years ago
- Status changed from Resolved to Feedback
the alert from https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1 just again triggered today in the morning and the graph does not look like it's good so a problem still persists.
Updated by livdywan over 2 years ago
- Status changed from Feedback to Resolved
okurz wrote:
the alert from https://stats.openqa-monitor.qa.suse.de/d/7W06NBWGk/job-age?tab=alert&viewPanel=5&orgId=1 just again triggered today in the morning and the graph does not look like it's good so a problem still persists.
The alert was fine when I resolved. See my comment with regard to the jobs that weren't getting picked up.
Updated by okurz over 2 years ago
- Related to action #111926: osd-deployment pipeline failed: test 481 -le 0, due to job age alert, likely just the raspberry pi based tests stuck in schedule added