Project

General

Profile

Actions

action #174583

closed

openqa/scripts-ci pipeline fails, jobs ended up in various not-ok states - New openQA API route to follow job obsolescence? size:S

Added by jbaier_cz 4 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2024-12-19
Due date:
% Done:

0%

Estimated time:

Description

Observation

The pipeline is failing because the openQA jobs got obsoleted:

See: https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3562638

{"blocked_by_id":null,"id":4713396,"result":"obsoleted","state":"done"}
{"blocked_by_id":null,"id":4713397,"result":"obsoleted","state":"done"}

for unknown reason as we don't trigger with OBSOLETE and that should not be default according to openQA documentation

The multimachine case looks a bit more involved e.g. https://gitlab.suse.de/openqa/scripts-ci/-/jobs/3625091 :

{"blocked_by_id":null,"id":16374878,"result":"skipped","state":"cancelled"}
1490{"blocked_by_id":null,"id":16374879,"result":"timeout_exceeded","state":"done"}

Acceptance Criteria

  • AC1: Unfinished jobs don't cause failures in GitLab pipelines

Suggestions

  • Verify if this is a specific worker or workers and take them out of production
  • Consider restarting affected jobs
  • An "obsolete" should be considered part of expected behavior. How about a new openQA API route to follow job obsolescence? -> handled in #175299
  • Ignore the case of "obsoleted" jobs as the pipeline runs frequently enough anyway. check why jobs ended up as obsolete even though scripts-ci don't trigger with obsoletion
  • Check whether we cancel the full parallel cluster in case a job in it is cancelled/obsoleted as we also saw jobs with parallel dependencies ending up with the result "timeout_exceeded".
    • Treat skipped/cancelled the same as obsoleted (and ignore it)
    • Ensure this is logged in case it is not always the case

Mitigations


Related issues 3 (1 open2 closed)

Related to openQA Tests (public) - action #175698: [tools][multi-machine tests] Timeout_exceeded on multiple workers including arm1, arm2 and maniaResolvedmkittler2025-01-17

Actions
Copied to openQA Project (public) - action #175299: Option to ignore obsoleted jobs when using openqa-cli monitorNew2024-12-19

Actions
Copied to openQA Project (public) - action #175305: Flag to return restarted jobs when using openQA jobs API route size:SResolveddheidler2024-12-19

Actions
Actions

Also available in: Atom PDF