Project

General

Profile

Actions

action #164898

open

coordination #58184: [saga][epic][use case] full version control awareness within openQA

coordination #152847: [epic] version control awareness within openQA for test distributions

Replace fetchneedles with a minion job size:M

Added by tinita 2 months ago. Updated about 11 hours ago.

Status:
Feedback
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
2024-10-11 (Due in 1 day)
% Done:

0%

Estimated time:

Description

Motivation

See #162125 for the spike solution.

fetchneedles is a script provided within the openQA repo and we call it on o3+osd in a cron job every minute to keep test distribution checkouts updated but it's not well documented, can interfer with openQA internal git handling and (probably) still needs initial checkout of test distributions.

Acceptance criteria

  • AC1: Instead of the fetchneedles cronjob test/needle repos are updated via a minion job when tests are started
  • AC2: If necessary, also call that minion job regularly

Suggestions

Out of scope

  • Doing any kind of initial checkout if git working copies do not exist yet

Related issues 11 (4 open7 closed)

Related to openQA Project - action #164889: Ensure git repos cloned by minions are cleaned up regularly size:SResolved

Actions
Related to openQA Project - action #164886: Use OpenQA::Git for all our git wrappers size:SResolvedrobert.richardson

Actions
Related to openQA Project - action #164883: Use same minion guard for save_needle, delete_needles and git_clone size:SResolvedtinita

Actions
Related to openQA Infrastructure - action #164895: o3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02Resolvedtinita2024-08-02

Actions
Related to openQA Project - action #165066: Ensure local changes to git repos cloned by git_auto_clone are left alone size:SResolveddheidler2024-08-08

Actions
Related to openQA Infrastructure - action #166721: [alert] Waves of emails due to kex_exchange_identification: Connection closed by remote host errorsBlockedlivdywan

Actions
Related to openQA Project - action #156922: Run os-autoinst-distri-openQA directly from git without anything related in o3:/var/lib/openqa/share/testsBlockedokurz

Actions
Related to openQA Project - action #167635: Needle Admin Interface shows wrong timestamps; t/ui/21-admin-needles.t can fail locally depending on time zone size:SWorkable2024-09-30

Actions
Related to openQA Project - action #168013: Only make one api call in openqa-advanced-retrigger-jobsNew2024-10-09

Actions
Copied from openQA Project - action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:SResolvedtinita2024-06-12

Actions
Copied to openQA Project - action #167386: Handle too many warnings "Local checkout at … but requesting to clone from" size:SResolveddheidler2024-09-25

Actions
Actions #1

Updated by tinita 2 months ago

  • Copied from action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:S added
Actions #2

Updated by tinita 2 months ago

  • Related to action #164889: Ensure git repos cloned by minions are cleaned up regularly size:S added
Actions #3

Updated by tinita 2 months ago

  • Related to action #164886: Use OpenQA::Git for all our git wrappers size:S added
Actions #4

Updated by tinita 2 months ago

  • Related to action #164883: Use same minion guard for save_needle, delete_needles and git_clone size:S added
Actions #5

Updated by tinita 2 months ago

  • Related to action #164895: o3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02 added
Actions #6

Updated by tinita 2 months ago

  • Description updated (diff)
Actions #7

Updated by tinita about 2 months ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to tinita

Blocking on the mentioned related tickets

Actions #8

Updated by tinita about 1 month ago

  • Related to action #165066: Ensure local changes to git repos cloned by git_auto_clone are left alone size:S added
Actions #9

Updated by tinita about 1 month ago

  • Status changed from Blocked to New
  • Priority changed from Normal to High

This should be done soon.
We just had a problem on o3, because os-autoinst-distri-example was scheduled, resulting in /var/lib/openqa/share/tests/example having a clone, but the default remote branch is main. fetchneedles cannot deal with that. The branch name can be configured, but it has to be the same for all repositories.
I deleted /var/lib/openqa/share/tests/example now.

Related tickets:

Actions #10

Updated by tinita about 1 month ago

  • Status changed from New to In Progress
Actions #11

Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/5909 Move some tests out of 14-grutasks.t

Actions #12

Updated by openqa_review about 1 month ago

  • Due date set to 2024-09-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by tinita about 1 month ago

Draft: https://github.com/os-autoinst/openQA/pull/5910

Todo: schedule an update regularly, independent from a running test

Open questions: Should this be enabled with the same git_auto_clone feature which handles CASEDIR/NEEDLES_DIR repos, or should it be a feature that needs to be enabled additionally?

Actions #14

Updated by livdywan about 1 month ago

  • Subject changed from Replace fetchneedles with a minion job to Replace fetchneedles with a minion job size:M
  • Description updated (diff)
Actions #15

Updated by tinita about 1 month ago

  • Status changed from In Progress to Workable

due to vacation

Actions #16

Updated by livdywan 28 days ago

  • Related to action #166721: [alert] Waves of emails due to kex_exchange_identification: Connection closed by remote host errors added
Actions #17

Updated by tinita 28 days ago

  • Status changed from Workable to In Progress
Actions #18

Updated by tinita 26 days ago

https://github.com/os-autoinst/openQA/pull/5910 Automatically update git for jobs without CASEDIR/NEEDLES_DIR

Actions #19

Updated by okurz 24 days ago

  • Status changed from In Progress to Workable
Actions #20

Updated by livdywan 23 days ago

  • Due date deleted (2024-09-18)
Actions #21

Updated by okurz 21 days ago

  • Related to action #156922: Run os-autoinst-distri-openQA directly from git without anything related in o3:/var/lib/openqa/share/tests added
Actions #22

Updated by tinita 17 days ago

  • Status changed from Workable to Feedback
Actions #23

Updated by livdywan 16 days ago

tinita wrote in #note-18:

https://github.com/os-autoinst/openQA/pull/5910 Automatically update git for jobs without CASEDIR/NEEDLES_DIR

Merged.

  • Can /etc/cron.d/openqa-update-git which calls fetchneedles now be removed on o3?
  • Can etc/master/cron.d/SLES.CRON also be removed accordingly to stop calling fetchneedles on osd?
  • Anything else needed to fulfill AC1?
Actions #24

Updated by tinita 16 days ago

My plan is:

  • Wait for deployment on o3 (DONE)
  • Enable the feature on o3 and comment out the fetchneedles cronjob
  • Monitor for a while

If things work out, improve documentation and enable it on osd as well.

Since it has been deployed on o3, I will enable it now and closely monitor.

Actions #26

Updated by okurz 15 days ago

Please handle the symptoms of incomplete GRU git clone related jobs from yesterday and today.

Actions #27

Updated by okurz 15 days ago

  • Copied to action #167386: Handle too many warnings "Local checkout at … but requesting to clone from" size:S added
Actions #28

Updated by tinita 14 days ago

Yesterday I restarted all incomplete jobs starting 2024-09-24 16:00 UTC

Actions #29

Updated by tinita 14 days ago

I enabled the feature again and scheduled an openQA build:
https://openqa.opensuse.org/tests/overview?version=Tumbleweed&distri=openqa&build=Build%3ATW.31611-tinita
It's failing because there is no image, but the git_clone minion job passed:
https://openqa.opensuse.org/minion/jobs?id=4355020

Now I will monitor for other scheduled products and possible incompletes.

Actions #30

Updated by okurz 14 days ago

  • Status changed from Feedback to In Progress
Actions #31

Updated by tinita 14 days ago · Edited

There is some problem with the hourly timer.
It's retrying every 30something seconds, without an error message, so we assume it's having problems aquiring minion guards.
Here is the minion job:
https://openqa.opensuse.org/minion/jobs?id=4355236
I deleted it now and copied the YAML here:
Minion job

Actions #32

Updated by tinita 13 days ago

https://github.com/os-autoinst/openQA/pull/5951 for making the systemd script work.

next step: also enqueue git_clone for job restarts

Actions #33

Updated by openqa_review 13 days ago

  • Due date set to 2024-10-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #34

Updated by tinita 13 days ago

Updated https://github.com/os-autoinst/openQA/pull/5951

I had a closer look at OpenQA::Task::Needle::Save now and will change it so that it creates a limit_needle_task and a git_clone_${needledir}_task guard, so that the git_clone task only needs the git_clone guard for every path, but not the limit_needle_task guard.

Actions #35

Updated by tinita 10 days ago

https://github.com/os-autoinst/openQA/pull/5961 Improve minion guards for needle tasks

Actions #36

Updated by tinita 10 days ago

Apparently the delete_needles task is also doing git operations, just hidden in the result class.

While trying to change the guard for that as well, I noticed that t/ui/21-admin-needles.t is failing for me locally.

ok 6 - last use is right                                                                                 
not ok 7 - last match is right                                                                           

#   Failed test 'last match is right'                                                                    
#   at t/ui/21-admin-needles.t line 76.                                                                  
#          got: 'about 12 hours ago'                                                                     
#     expected: 'about 14 hours ago'        

I did further checks: Going to the needle admin interface on o3, osd and in my local instance I noticed that every reported time of last seen or last match is two hours older than it should be, so it's a timezone issue.
The server returns a timestamp without an offset to the client.

Currently writing a fix for this.

I need this test to pass so I can test my actual feature.

Actions #37

Updated by livdywan 9 days ago

  • Related to action #167635: Needle Admin Interface shows wrong timestamps; t/ui/21-admin-needles.t can fail locally depending on time zone size:S added
Actions #38

Updated by tinita 8 days ago

While adding code for the new minion guards in OpenQA::Task::Needle::Delete I saw that we have insufficient tests in that area, so I added tests first:
https://github.com/os-autoinst/openQA/pull/5969

Actions #39

Updated by tinita 7 days ago

https://github.com/os-autoinst/openQA/pull/5961 ready for review

next step (again): also enqueue git_clone for job restarts

Actions #40

Updated by tinita 1 day ago

Ready: https://github.com/os-autoinst/openQA/pull/5953 Trigger git_clone also for Job restart

Actions #41

Updated by tinita about 18 hours ago · Edited

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5953 merged.

I just enabled the git_auto_update feature on o3 and disabled fetchneedles in /etc/cron.d/openqa-update-git.

Looks good so far, several git_clone jobs were triggered for classic CASEDIR jobs since then https://openqa.opensuse.org/minion/jobs?task=git_clone&state=&queue=¬e=

Actions #42

Updated by tinita about 17 hours ago · Edited

I enabled and started openqa-enqueue-git-auto-update.timer but saw another problem again:
https://openqa.opensuse.org/minion/jobs?id=4405061

---
args:
- /var/lib/openqa/share/tests/example: ~
  /var/lib/openqa/share/tests/obs: ~
  /var/lib/openqa/share/tests/openqa: ~
  /var/lib/openqa/share/tests/openqa/products/openqa/needles: ~
  /var/lib/openqa/share/tests/opensuse: ~
  /var/lib/openqa/share/tests/opensuse/products/alp/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/kubic/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/microos/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/opensuse/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/sle-micro/needles: ~
attempts: 1
children: []
created: 2024-10-09T10:36:06.786822Z
delayed: 2024-10-09T10:52:04.955126Z
expires: ~
finished: ~
id: 4405061
lax: 0
notes:
  gru_id: 20646663
parents: []
priority: 10
queue: default
result: ~
retried: 2024-10-09T10:51:26.955126Z
retries: 26
started: 2024-10-09T10:51:26.906912Z
state: inactive
task: git_clone
time: 2024-10-09T10:52:03.132277Z
worker: 2644
Could not get guard for git_clone_/var/lib/openqa/share/tests/opensuse/products/opensuse/needles_task, retrying in 31s

The needle directories also need to be skipped if they are symlinks. I only checked the product dirs.

https://github.com/os-autoinst/openQA/pull/5991 Skip all symlinks for git_auto_update service

Disabling the timer again for now, but the feature can stay active.

Actions #43

Updated by tinita about 14 hours ago

Merged and deployed on o3.

systemctl enable openqa-enqueue-git-auto-update.timer
systemctl start openqa-enqueue-git-auto-update.timer

It immediately started and finished very fast, according to logfile everything ok:
https://openqa.opensuse.org/minion/jobs?id=4405459

Actions #44

Updated by tinita about 11 hours ago

One observation:
There can be cases when a lot of jobs are created. One case is openqa-investigate.
I can see many git_clone jobs created at the same time and see many investigate jobs in the scheduled table.
In that case the git_clone jobs are waiting for each other and often doing the same thing (although sometimes with different refs of course).
(This was also the case before this ticket as in that case we have an explicit CASEDIR.)

I think it might be better suited for the scheduler instead of the individual job creation/restart/schedule product events.

The scheduler regularly looks for new jobs. There could be an additional step where it collects all CASEDIR/NEEDLE_DIR settings from new jobs and creates one minion job per CASEDIR/NEEDLE_DIR.
But maybe this would require a new status.

Actions #45

Updated by tinita about 11 hours ago

  • Related to action #168013: Only make one api call in openqa-advanced-retrigger-jobs added
Actions

Also available in: Atom PDF