Project

General

Profile

Actions

action #94465

closed

[tools] zkvm tests are scheduled by retriggering month old jobs even though we do not have any "svirt" workers anymore

Added by okurz almost 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
2021-06-22
Due date:
2021-07-06
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

https://openqa.suse.de/tests/6272912 is currently scheduled since multiple days because we do not have any workers for openQA machine selection "zkvm" with the worker class "svirt" anymore since https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/320/diffs .

Problem

  • Who triggered these tests and why?
  • Can we prevent the erroneous retrigger of tests that can not work?

Suggestions

  • Find out who triggered these tests and ask them directly to learn what was the intention

Related issues 1 (0 open1 closed)

Related to openQA Infrastructure - action #93119: [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12Closedmgriessmeier2021-05-26

Actions
Actions #1

Updated by okurz almost 3 years ago

  • Due date set to 2021-07-06
  • Status changed from New to Feedback

https://openqa.suse.de/admin/auditlog reveals it was mgrifalconi retriggering the tests.

I asked in https://chat.suse.de/channel/qem-openqa-review?msg=y4PeEyN6yhtsHXmQD "Michael Grifalconi hi. I saw that you triggered https://openqa.suse.de/tests/6272912 as a clone of https://openqa.suse.de/tests/6075632 . I created https://progress.opensuse.org/issues/94465 for the problem that this can not work. May I ask what was your intention? And why retrigger tests when not waiting for the result? I wonder if there is a flaw in the process if scheduling retriggers but not receiving results within a week is not showing up as a problem for you as reviewer of incident test results"

I cancelled the tests again scheduled for machine "zkvm". I have changed the "machine" configuration of the existing entries for s390x-kvm-sle15 to schedule on worker class "s390-kvm-sle12" for now and have updated existing jobs, see #93119#note-20

I asked mgrifalconi if I can learn from them what their original intentions were.

Actions #2

Updated by okurz almost 3 years ago

I managed to mess up the settings of quite some jobs on osd overwriting the worker class with "s390x-kvm-sle15". Trying to repair that. With something like select test,arch,machine,value from jobs,job_settings where jobs.id = job_settings.job_id and state='scheduled' and key='WORKER_CLASS' and value='s390x-kvm-sle15' limit 10; I can look up the incorrectly configured jobs. Turns out with select count(test) from jobs,job_settings where jobs.id = job_settings.job_id and state='scheduled' and key='WORKER_CLASS' and value='s390x-kvm-sle15'; that it's 1090 jobs, not too bad ;) I should be able to lookup the correct worker class settings from each machine setting. With select name,value from machines,machine_settings where machines.id = machine_settings.machine_id and key='WORKER_CLASS'; we can show all worker class per machine. With select value from machines,machine_settings where machines.id = machine_settings.machine_id and key='WORKER_CLASS' and name=(select machine from jobs where id=$id); we can get the worker class that a job $id should have.

So we should be able to update the worker class from the machine

for job in $(sudo -u geekotest psql --tuples-only --command="select job_id from job_settings where state='scheduled' and key='WORKER_CLASS' and value='s390x-kvm-sle15';" openqa); do sudo -u geekotest psql --command="update job_settings set value=(select value from machines,machine_settings where machines.id = machine_settings.machine_id and key='WORKER_CLASS' and name=(select machine from jobs where id=$job)) where job_id=$job and key='WORKER_CLASS';" openqa; done

Let's see what that breaks now :D

EDIT: oorlov informed me that I am only fixing "scheduled" jobs. Of course. So I also applied the above without the state='scheduled' filter. That's running for long.
EDIT: I optimized a bit by running the complete command as "geekotest" so skipping the sudo's

Actions #3

Updated by okurz almost 3 years ago

  • Related to action #93119: [s390x] Update of s390x Test infrastructure after shutdown of Mainframe zEC12 added
Actions #4

Updated by okurz almost 3 years ago

  • Status changed from Feedback to Resolved

Over night the fixes were applied on the database content. Regarding the original problem of "Can we prevent the erroneous retrigger of tests that can not work?" I guess the best that we should do is to always try to make backward-compatible changes, e.g. keep an older worker class around for long.

Actions

Also available in: Atom PDF