Project

General

Profile

Actions

action #164898

closed

coordination #58184: [saga][epic][use case] full version control awareness within openQA

coordination #152847: [epic] version control awareness within openQA for test distributions

Replace fetchneedles with a minion job for the regular update of git repos size:M

Added by tinita 4 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
Due date:
% Done:

0%

Estimated time:

Description

Motivation

See #162125 for the spike solution.

fetchneedles is a script provided within the openQA repo and we call it on o3+osd in a cron job every minute to keep test distribution checkouts updated but it's not well documented, can interfer with openQA internal git handling and (probably) still needs initial checkout of test distributions.

Acceptance criteria

  • AC1: Instead of the fetchneedles cronjob test/needle repos are updated via a minion job when tests are started
  • AC2: If necessary, also call that minion job regularly

Suggestions

Out of scope

  • Doing any kind of initial checkout if git working copies do not exist yet

Related issues 14 (3 open11 closed)

Related to openQA Project - action #164889: Ensure git repos cloned by minions are cleaned up regularly size:SResolved

Actions
Related to openQA Project - action #164886: Use OpenQA::Git for all our git wrappers size:SResolvedrobert.richardson

Actions
Related to openQA Project - action #164883: Use same minion guard for save_needle, delete_needles and git_clone size:SResolvedtinita

Actions
Related to openQA Infrastructure - action #164895: o3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02Resolvedtinita2024-08-02

Actions
Related to openQA Project - action #165066: Ensure local changes to git repos cloned by git_auto_clone are left alone size:SResolveddheidler2024-08-08

Actions
Related to openQA Infrastructure - action #166721: [alert] Waves of emails due to kex_exchange_identification: Connection closed by remote host errors size:SFeedbacklivdywan

Actions
Related to openQA Project - action #156922: Run os-autoinst-distri-openQA directly from git without anything related in o3:/var/lib/openqa/share/tests size:SWorkabledheidler

Actions
Related to openQA Project - action #167635: Needle Admin Interface shows wrong timestamps; t/ui/21-admin-needles.t can fail locally depending on time zone size:SResolveddheidler2024-09-30

Actions
Related to openQA Project - action #168013: Only make one api call in openqa-advanced-retrigger-jobsResolvedmkittler2024-10-09

Actions
Related to openQA Project - action #124487: [openqa_logwarn] Can't call method "BUILD" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pmResolvedtinita2023-02-142024-11-12

Actions
Copied from openQA Project - action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:SResolvedtinita2024-06-12

Actions
Copied to openQA Project - action #167386: Handle too many warnings "Local checkout at … but requesting to clone from" size:SResolveddheidler2024-09-25

Actions
Copied to openQA Infrastructure - action #168376: Enable automatic openQA git clone instead of fetchneedles on OSD size:SBlockedmkittler

Actions
Copied to openQA Project - action #168400: Improve locking scope of git_clone tasks size:SResolveddheidler2024-10-17

Actions
Actions #1

Updated by tinita 4 months ago

  • Copied from action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:S added
Actions #2

Updated by tinita 4 months ago

  • Related to action #164889: Ensure git repos cloned by minions are cleaned up regularly size:S added
Actions #3

Updated by tinita 4 months ago

  • Related to action #164886: Use OpenQA::Git for all our git wrappers size:S added
Actions #4

Updated by tinita 4 months ago

  • Related to action #164883: Use same minion guard for save_needle, delete_needles and git_clone size:S added
Actions #5

Updated by tinita 4 months ago

  • Related to action #164895: o3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02 added
Actions #6

Updated by tinita 4 months ago

  • Description updated (diff)
Actions #7

Updated by tinita 3 months ago

  • Description updated (diff)
  • Status changed from New to Blocked
  • Assignee set to tinita

Blocking on the mentioned related tickets

Actions #8

Updated by tinita 3 months ago

  • Related to action #165066: Ensure local changes to git repos cloned by git_auto_clone are left alone size:S added
Actions #9

Updated by tinita 3 months ago

  • Status changed from Blocked to New
  • Priority changed from Normal to High

This should be done soon.
We just had a problem on o3, because os-autoinst-distri-example was scheduled, resulting in /var/lib/openqa/share/tests/example having a clone, but the default remote branch is main. fetchneedles cannot deal with that. The branch name can be configured, but it has to be the same for all repositories.
I deleted /var/lib/openqa/share/tests/example now.

Related tickets:

Actions #10

Updated by tinita 3 months ago

  • Status changed from New to In Progress
Actions #11

Updated by tinita 3 months ago

https://github.com/os-autoinst/openQA/pull/5909 Move some tests out of 14-grutasks.t

Actions #12

Updated by openqa_review 3 months ago

  • Due date set to 2024-09-18

Setting due date based on mean cycle time of SUSE QE Tools

Actions #13

Updated by tinita 3 months ago

Draft: https://github.com/os-autoinst/openQA/pull/5910

Todo: schedule an update regularly, independent from a running test

Open questions: Should this be enabled with the same git_auto_clone feature which handles CASEDIR/NEEDLES_DIR repos, or should it be a feature that needs to be enabled additionally?

Actions #14

Updated by livdywan 3 months ago

  • Subject changed from Replace fetchneedles with a minion job to Replace fetchneedles with a minion job size:M
  • Description updated (diff)
Actions #15

Updated by tinita 3 months ago

  • Status changed from In Progress to Workable

due to vacation

Actions #16

Updated by livdywan 3 months ago

  • Related to action #166721: [alert] Waves of emails due to kex_exchange_identification: Connection closed by remote host errors size:S added
Actions #17

Updated by tinita 3 months ago

  • Status changed from Workable to In Progress
Actions #18

Updated by tinita 3 months ago

https://github.com/os-autoinst/openQA/pull/5910 Automatically update git for jobs without CASEDIR/NEEDLES_DIR

Actions #19

Updated by okurz 2 months ago

  • Status changed from In Progress to Workable
Actions #20

Updated by livdywan 2 months ago

  • Due date deleted (2024-09-18)
Actions #21

Updated by okurz 2 months ago

  • Related to action #156922: Run os-autoinst-distri-openQA directly from git without anything related in o3:/var/lib/openqa/share/tests size:S added
Actions #22

Updated by tinita 2 months ago

  • Status changed from Workable to Feedback
Actions #23

Updated by livdywan 2 months ago

tinita wrote in #note-18:

https://github.com/os-autoinst/openQA/pull/5910 Automatically update git for jobs without CASEDIR/NEEDLES_DIR

Merged.

  • Can /etc/cron.d/openqa-update-git which calls fetchneedles now be removed on o3?
  • Can etc/master/cron.d/SLES.CRON also be removed accordingly to stop calling fetchneedles on osd?
  • Anything else needed to fulfill AC1?
Actions #24

Updated by tinita 2 months ago

My plan is:

  • Wait for deployment on o3 (DONE)
  • Enable the feature on o3 and comment out the fetchneedles cronjob
  • Monitor for a while

If things work out, improve documentation and enable it on osd as well.

Since it has been deployed on o3, I will enable it now and closely monitor.

Actions #26

Updated by okurz 2 months ago

Please handle the symptoms of incomplete GRU git clone related jobs from yesterday and today.

Actions #27

Updated by okurz 2 months ago

  • Copied to action #167386: Handle too many warnings "Local checkout at … but requesting to clone from" size:S added
Actions #28

Updated by tinita 2 months ago

Yesterday I restarted all incomplete jobs starting 2024-09-24 16:00 UTC

Actions #29

Updated by tinita 2 months ago

I enabled the feature again and scheduled an openQA build:
https://openqa.opensuse.org/tests/overview?version=Tumbleweed&distri=openqa&build=Build%3ATW.31611-tinita
It's failing because there is no image, but the git_clone minion job passed:
https://openqa.opensuse.org/minion/jobs?id=4355020

Now I will monitor for other scheduled products and possible incompletes.

Actions #30

Updated by okurz 2 months ago

  • Status changed from Feedback to In Progress
Actions #31

Updated by tinita 2 months ago · Edited

There is some problem with the hourly timer.
It's retrying every 30something seconds, without an error message, so we assume it's having problems aquiring minion guards.
Here is the minion job:
https://openqa.opensuse.org/minion/jobs?id=4355236
I deleted it now and copied the YAML here:
Minion job

Actions #32

Updated by tinita 2 months ago

https://github.com/os-autoinst/openQA/pull/5951 for making the systemd script work.

next step: also enqueue git_clone for job restarts

Actions #33

Updated by openqa_review 2 months ago

  • Due date set to 2024-10-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions #34

Updated by tinita 2 months ago

Updated https://github.com/os-autoinst/openQA/pull/5951

I had a closer look at OpenQA::Task::Needle::Save now and will change it so that it creates a limit_needle_task and a git_clone_${needledir}_task guard, so that the git_clone task only needs the git_clone guard for every path, but not the limit_needle_task guard.

Actions #35

Updated by tinita about 2 months ago

https://github.com/os-autoinst/openQA/pull/5961 Improve minion guards for needle tasks

Actions #36

Updated by tinita about 2 months ago

Apparently the delete_needles task is also doing git operations, just hidden in the result class.

While trying to change the guard for that as well, I noticed that t/ui/21-admin-needles.t is failing for me locally.

ok 6 - last use is right                                                                                 
not ok 7 - last match is right                                                                           

#   Failed test 'last match is right'                                                                    
#   at t/ui/21-admin-needles.t line 76.                                                                  
#          got: 'about 12 hours ago'                                                                     
#     expected: 'about 14 hours ago'        

I did further checks: Going to the needle admin interface on o3, osd and in my local instance I noticed that every reported time of last seen or last match is two hours older than it should be, so it's a timezone issue.
The server returns a timestamp without an offset to the client.

Currently writing a fix for this.

I need this test to pass so I can test my actual feature.

Actions #37

Updated by livdywan about 2 months ago

  • Related to action #167635: Needle Admin Interface shows wrong timestamps; t/ui/21-admin-needles.t can fail locally depending on time zone size:S added
Actions #38

Updated by tinita about 2 months ago

While adding code for the new minion guards in OpenQA::Task::Needle::Delete I saw that we have insufficient tests in that area, so I added tests first:
https://github.com/os-autoinst/openQA/pull/5969

Actions #39

Updated by tinita about 2 months ago

https://github.com/os-autoinst/openQA/pull/5961 ready for review

next step (again): also enqueue git_clone for job restarts

Actions #40

Updated by tinita about 2 months ago

Ready: https://github.com/os-autoinst/openQA/pull/5953 Trigger git_clone also for Job restart

Actions #41

Updated by tinita about 2 months ago · Edited

  • Status changed from In Progress to Feedback

https://github.com/os-autoinst/openQA/pull/5953 merged.

I just enabled the git_auto_update feature on o3 and disabled fetchneedles in /etc/cron.d/openqa-update-git.

Looks good so far, several git_clone jobs were triggered for classic CASEDIR jobs since then https://openqa.opensuse.org/minion/jobs?task=git_clone&state=&queue=¬e=

Actions #42

Updated by tinita about 2 months ago · Edited

I enabled and started openqa-enqueue-git-auto-update.timer but saw another problem again:
https://openqa.opensuse.org/minion/jobs?id=4405061

---
args:
- /var/lib/openqa/share/tests/example: ~
  /var/lib/openqa/share/tests/obs: ~
  /var/lib/openqa/share/tests/openqa: ~
  /var/lib/openqa/share/tests/openqa/products/openqa/needles: ~
  /var/lib/openqa/share/tests/opensuse: ~
  /var/lib/openqa/share/tests/opensuse/products/alp/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/kubic/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/microos/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/opensuse/needles: ~
  /var/lib/openqa/share/tests/opensuse/products/sle-micro/needles: ~
attempts: 1
children: []
created: 2024-10-09T10:36:06.786822Z
delayed: 2024-10-09T10:52:04.955126Z
expires: ~
finished: ~
id: 4405061
lax: 0
notes:
  gru_id: 20646663
parents: []
priority: 10
queue: default
result: ~
retried: 2024-10-09T10:51:26.955126Z
retries: 26
started: 2024-10-09T10:51:26.906912Z
state: inactive
task: git_clone
time: 2024-10-09T10:52:03.132277Z
worker: 2644
Could not get guard for git_clone_/var/lib/openqa/share/tests/opensuse/products/opensuse/needles_task, retrying in 31s

The needle directories also need to be skipped if they are symlinks. I only checked the product dirs.

https://github.com/os-autoinst/openQA/pull/5991 Skip all symlinks for git_auto_update service

Disabling the timer again for now, but the feature can stay active.

Actions #43

Updated by tinita about 2 months ago

Merged and deployed on o3.

systemctl enable openqa-enqueue-git-auto-update.timer
systemctl start openqa-enqueue-git-auto-update.timer

It immediately started and finished very fast, according to logfile everything ok:
https://openqa.opensuse.org/minion/jobs?id=4405459

Actions #44

Updated by tinita about 2 months ago

One observation:
There can be cases when a lot of jobs are created. One case is openqa-investigate.
I can see many git_clone jobs created at the same time and see many investigate jobs in the scheduled table.
In that case the git_clone jobs are waiting for each other and often doing the same thing (although sometimes with different refs of course).
(This was also the case before this ticket as in that case we have an explicit CASEDIR.)

I think it might be better suited for the scheduler instead of the individual job creation/restart/schedule product events.

The scheduler regularly looks for new jobs. There could be an additional step where it collects all CASEDIR/NEEDLE_DIR settings from new jobs and creates one minion job per CASEDIR/NEEDLE_DIR.
But maybe this would require a new status.

Actions #45

Updated by tinita about 2 months ago

  • Related to action #168013: Only make one api call in openqa-advanced-retrigger-jobs added
Actions #46

Updated by tinita about 2 months ago

Not sure what to do now. We should have a discussion whether it makes sense to move the code to the scheduler.

Actions #47

Updated by tinita about 2 months ago · Edited

I disabled the feature again because people were reporting that they are getting Another git task is ongoing. Try again later. when trying to save needles.
I checked the database:

select id, created, result from minion_jobs where result::text like '%Another git task is ongoing%' and created >= '2024-10-09' order by id limit 1000;
   id    |            created            |                           result                           
---------+-------------------------------+------------------------------------------------------------
 4407213 | 2024-10-09 17:51:44.048423+00 | {"error": "Another git task is ongoing. Try again later."}
...
 4412348 | 2024-10-10 08:24:35.936075+00 | {"error": "Another git task is ongoing. Try again later."}
 4412895 | 2024-10-10 09:06:21.612128+00 | {"error": "Another git task is ongoing. Try again later."}
(28 rows)

It was happening 28 times since yesterday morning.
I will check later how often we still see this. I turned of the feature at 08:57 UTC, so there was at least one occasion after that https://openqa.opensuse.org/minion/jobs?id=4412895

Actions #48

Updated by tinita about 2 months ago

We discussed today that it might work to check for existing GruTasks with the same args and assign the openqa jobs to those tasks.

I created a proof of concept:
https://github.com/os-autoinst/openQA/pull/6001

Actions #49

Updated by okurz about 2 months ago

  • Status changed from Feedback to In Progress
Actions #50

Updated by tinita about 2 months ago

  • Due date changed from 2024-10-11 to 2024-10-18

Unexpected challenges and necessary changes to existing code

Actions #51

Updated by tinita about 1 month ago

Actions #53

Updated by tinita about 1 month ago

I enabled the feature on o3 again and disabled the cronjob. Monitoring

Actions #54

Updated by tinita about 1 month ago

disabled again. I think I found a bug

Actions #55

Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/6015 Fix handling of job array in enqueue_git_clones

Actions #56

Updated by tinita about 1 month ago

https://github.com/os-autoinst/openQA/pull/6015 merged, enabled feature again on o3

Actions #57

Updated by okurz about 1 month ago

  • Subject changed from Replace fetchneedles with a minion job size:M to Replace fetchneedles with a minion job for the regular update of git repos size:M
Actions #58

Updated by okurz about 1 month ago

  • Copied to action #168376: Enable automatic openQA git clone instead of fetchneedles on OSD size:S added
Actions #59

Updated by tinita about 1 month ago

  • Copied to action #168400: Improve locking scope of git_clone tasks size:S added
Actions #60

Updated by tinita about 1 month ago

  • Status changed from In Progress to Resolved

Followup tickets created, resolving

Actions #61

Updated by okurz about 1 month ago

  • Due date deleted (2024-10-18)
Actions #62

Updated by tinita about 1 month ago

  • Related to action #124487: [openqa_logwarn] Can't call method "BUILD" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm added
Actions

Also available in: Atom PDF