action #164898
closedcoordination #58184: [saga][epic][use case] full version control awareness within openQA
coordination #152847: [epic] version control awareness within openQA for test distributions
Replace fetchneedles with a minion job for the regular update of git repos size:M
Description
Motivation¶
See #162125 for the spike solution.
fetchneedles is a script provided within the openQA repo and we call it on o3+osd in a cron job every minute to keep test distribution checkouts updated but it's not well documented, can interfer with openQA internal git handling and (probably) still needs initial checkout of test distributions.
Acceptance criteria¶
- AC1: Instead of the fetchneedles cronjob test/needle repos are updated via a minion job when tests are started
- AC2: If necessary, also call that minion job regularly
Suggestions¶
- See #162125 for the Proof of Concept: https://github.com/os-autoinst/openQA/pull/5808
- Wait for #164886, #164889, #164883
- To avoid having larger updates when no new tests were started for a longer time, consider also running the minion job regularly (like every hour)
- Add a new config value (to be bike shed)
Out of scope¶
- Doing any kind of initial checkout if git working copies do not exist yet
Updated by tinita 4 months ago
- Copied from action #162125: [timeboxed:10h][spike] Let openQA keep test distribution checkouts up to date without needing fetchneedles size:S added
Updated by tinita 4 months ago
- Related to action #164889: Ensure git repos cloned by minions are cleaned up regularly size:S added
Updated by tinita 4 months ago
- Related to action #164886: Use OpenQA::Git for all our git wrappers size:S added
Updated by tinita 4 months ago
- Related to action #164883: Use same minion guard for save_needle, delete_needles and git_clone size:S added
Updated by tinita 4 months ago
- Related to action #164895: o3 had corrupted needles git repo, lost uncommitted needles between 2024-07-31 and 2024-08-02 added
Updated by tinita 3 months ago
- Related to action #165066: Ensure local changes to git repos cloned by git_auto_clone are left alone size:S added
Updated by tinita 3 months ago
- Status changed from Blocked to New
- Priority changed from Normal to High
This should be done soon.
We just had a problem on o3, because os-autoinst-distri-example was scheduled, resulting in /var/lib/openqa/share/tests/example
having a clone, but the default remote branch is main
. fetchneedles cannot deal with that. The branch name can be configured, but it has to be the same for all repositories.
I deleted /var/lib/openqa/share/tests/example
now.
Related tickets:
- #164886 looks good and can be closed I guess
- #165066 https://github.com/os-autoinst/openQA/pull/5901 is merged, likely can be closed soon
- #164889 is the only one left, and we can risk running into a merge conflict
Updated by tinita 3 months ago
https://github.com/os-autoinst/openQA/pull/5909 Move some tests out of 14-grutasks.t
Updated by openqa_review 3 months ago
- Due date set to 2024-09-18
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita 3 months ago
Draft: https://github.com/os-autoinst/openQA/pull/5910
Todo: schedule an update regularly, independent from a running test
Open questions: Should this be enabled with the same git_auto_clone
feature which handles CASEDIR/NEEDLES_DIR repos, or should it be a feature that needs to be enabled additionally?
Updated by livdywan 2 months ago
- Related to action #166721: [alert] Waves of emails due to kex_exchange_identification: Connection closed by remote host errors added
Updated by tinita 2 months ago
https://github.com/os-autoinst/openQA/pull/5910 Automatically update git for jobs without CASEDIR/NEEDLES_DIR
Updated by okurz 2 months ago
- Related to action #156922: Run os-autoinst-distri-openQA directly from git without anything related in o3:/var/lib/openqa/share/tests size:S added
Updated by livdywan 2 months ago
tinita wrote in #note-18:
https://github.com/os-autoinst/openQA/pull/5910 Automatically update git for jobs without CASEDIR/NEEDLES_DIR
Merged.
- Can
/etc/cron.d/openqa-update-git
which callsfetchneedles
now be removed on o3? - Can etc/master/cron.d/SLES.CRON also be removed accordingly to stop calling
fetchneedles
on osd? - Anything else needed to fulfill AC1?
Updated by tinita 2 months ago
My plan is:
- Wait for deployment on o3 (DONE)
- Enable the feature on o3 and comment out the fetchneedles cronjob
- Monitor for a while
If things work out, improve documentation and enable it on osd as well.
Since it has been deployed on o3, I will enable it now and closely monitor.
Updated by tinita 2 months ago
Found a typo: https://github.com/os-autoinst/openQA/pull/5945
Updated by okurz about 2 months ago
Please handle the symptoms of incomplete GRU git clone related jobs from yesterday and today.
Updated by okurz about 2 months ago
- Copied to action #167386: Handle too many warnings "Local checkout at … but requesting to clone from" size:S added
Updated by tinita about 2 months ago
Yesterday I restarted all incomplete jobs starting 2024-09-24 16:00 UTC
Updated by tinita about 2 months ago
I enabled the feature again and scheduled an openQA build:
https://openqa.opensuse.org/tests/overview?version=Tumbleweed&distri=openqa&build=Build%3ATW.31611-tinita
It's failing because there is no image, but the git_clone minion job passed:
https://openqa.opensuse.org/minion/jobs?id=4355020
Now I will monitor for other scheduled products and possible incompletes.
Updated by okurz about 2 months ago
- Status changed from Feedback to In Progress
Updated by tinita about 2 months ago · Edited
There is some problem with the hourly timer.
It's retrying every 30something seconds, without an error message, so we assume it's having problems aquiring minion guards.
Here is the minion job:
https://openqa.opensuse.org/minion/jobs?id=4355236
I deleted it now and copied the YAML here:
Minion job
Updated by tinita about 2 months ago
https://github.com/os-autoinst/openQA/pull/5951 for making the systemd script work.
next step: also enqueue git_clone for job restarts
Updated by openqa_review about 2 months ago
- Due date set to 2024-10-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by tinita about 2 months ago
Updated https://github.com/os-autoinst/openQA/pull/5951
I had a closer look at OpenQA::Task::Needle::Save
now and will change it so that it creates a limit_needle_task
and a git_clone_${needledir}_task
guard, so that the git_clone
task only needs the git_clone
guard for every path, but not the limit_needle_task
guard.
Updated by tinita about 2 months ago
https://github.com/os-autoinst/openQA/pull/5961 Improve minion guards for needle tasks
Updated by tinita about 2 months ago
Apparently the delete_needles task is also doing git operations, just hidden in the result class.
While trying to change the guard for that as well, I noticed that t/ui/21-admin-needles.t
is failing for me locally.
ok 6 - last use is right
not ok 7 - last match is right
# Failed test 'last match is right'
# at t/ui/21-admin-needles.t line 76.
# got: 'about 12 hours ago'
# expected: 'about 14 hours ago'
I did further checks: Going to the needle admin interface on o3, osd and in my local instance I noticed that every reported time of last seen or last match is two hours older than it should be, so it's a timezone issue.
The server returns a timestamp without an offset to the client.
Currently writing a fix for this.
I need this test to pass so I can test my actual feature.
Updated by livdywan about 2 months ago
- Related to action #167635: Needle Admin Interface shows wrong timestamps; t/ui/21-admin-needles.t can fail locally depending on time zone size:S added
Updated by tinita about 2 months ago
While adding code for the new minion guards in OpenQA::Task::Needle::Delete I saw that we have insufficient tests in that area, so I added tests first:
https://github.com/os-autoinst/openQA/pull/5969
Updated by tinita about 2 months ago
https://github.com/os-autoinst/openQA/pull/5961 ready for review
next step (again): also enqueue git_clone for job restarts
Updated by tinita about 2 months ago
Ready: https://github.com/os-autoinst/openQA/pull/5953 Trigger git_clone also for Job restart
Updated by tinita about 2 months ago · Edited
- Status changed from In Progress to Feedback
https://github.com/os-autoinst/openQA/pull/5953 merged.
I just enabled the git_auto_update
feature on o3 and disabled fetchneedles in /etc/cron.d/openqa-update-git
.
Looks good so far, several git_clone
jobs were triggered for classic CASEDIR jobs since then https://openqa.opensuse.org/minion/jobs?task=git_clone&state=&queue=¬e=
Updated by tinita about 2 months ago · Edited
I enabled and started openqa-enqueue-git-auto-update.timer
but saw another problem again:
https://openqa.opensuse.org/minion/jobs?id=4405061
---
args:
- /var/lib/openqa/share/tests/example: ~
/var/lib/openqa/share/tests/obs: ~
/var/lib/openqa/share/tests/openqa: ~
/var/lib/openqa/share/tests/openqa/products/openqa/needles: ~
/var/lib/openqa/share/tests/opensuse: ~
/var/lib/openqa/share/tests/opensuse/products/alp/needles: ~
/var/lib/openqa/share/tests/opensuse/products/kubic/needles: ~
/var/lib/openqa/share/tests/opensuse/products/microos/needles: ~
/var/lib/openqa/share/tests/opensuse/products/opensuse/needles: ~
/var/lib/openqa/share/tests/opensuse/products/sle-micro/needles: ~
attempts: 1
children: []
created: 2024-10-09T10:36:06.786822Z
delayed: 2024-10-09T10:52:04.955126Z
expires: ~
finished: ~
id: 4405061
lax: 0
notes:
gru_id: 20646663
parents: []
priority: 10
queue: default
result: ~
retried: 2024-10-09T10:51:26.955126Z
retries: 26
started: 2024-10-09T10:51:26.906912Z
state: inactive
task: git_clone
time: 2024-10-09T10:52:03.132277Z
worker: 2644
Could not get guard for git_clone_/var/lib/openqa/share/tests/opensuse/products/opensuse/needles_task, retrying in 31s
The needle directories also need to be skipped if they are symlinks. I only checked the product dirs.
https://github.com/os-autoinst/openQA/pull/5991 Skip all symlinks for git_auto_update service
Disabling the timer again for now, but the feature can stay active.
Updated by tinita about 2 months ago
Merged and deployed on o3.
systemctl enable openqa-enqueue-git-auto-update.timer
systemctl start openqa-enqueue-git-auto-update.timer
It immediately started and finished very fast, according to logfile everything ok:
https://openqa.opensuse.org/minion/jobs?id=4405459
Updated by tinita about 2 months ago
One observation:
There can be cases when a lot of jobs are created. One case is openqa-investigate.
I can see many git_clone
jobs created at the same time and see many investigate jobs in the scheduled table.
In that case the git_clone jobs are waiting for each other and often doing the same thing (although sometimes with different refs of course).
(This was also the case before this ticket as in that case we have an explicit CASEDIR.)
I think it might be better suited for the scheduler instead of the individual job creation/restart/schedule product events.
The scheduler regularly looks for new jobs. There could be an additional step where it collects all CASEDIR/NEEDLE_DIR settings from new jobs and creates one minion job per CASEDIR/NEEDLE_DIR.
But maybe this would require a new status.
Updated by tinita about 2 months ago
- Related to action #168013: Only make one api call in openqa-advanced-retrigger-jobs added
Updated by tinita about 1 month ago
Not sure what to do now. We should have a discussion whether it makes sense to move the code to the scheduler.
Updated by tinita about 1 month ago · Edited
I disabled the feature again because people were reporting that they are getting Another git task is ongoing. Try again later.
when trying to save needles.
I checked the database:
select id, created, result from minion_jobs where result::text like '%Another git task is ongoing%' and created >= '2024-10-09' order by id limit 1000;
id | created | result
---------+-------------------------------+------------------------------------------------------------
4407213 | 2024-10-09 17:51:44.048423+00 | {"error": "Another git task is ongoing. Try again later."}
...
4412348 | 2024-10-10 08:24:35.936075+00 | {"error": "Another git task is ongoing. Try again later."}
4412895 | 2024-10-10 09:06:21.612128+00 | {"error": "Another git task is ongoing. Try again later."}
(28 rows)
It was happening 28 times since yesterday morning.
I will check later how often we still see this. I turned of the feature at 08:57 UTC, so there was at least one occasion after that https://openqa.opensuse.org/minion/jobs?id=4412895
Updated by tinita about 1 month ago
We discussed today that it might work to check for existing GruTasks with the same args and assign the openqa jobs to those tasks.
I created a proof of concept:
https://github.com/os-autoinst/openQA/pull/6001
Updated by tinita about 1 month ago
- Due date changed from 2024-10-11 to 2024-10-18
Unexpected challenges and necessary changes to existing code
Updated by tinita about 1 month ago
Draft: https://github.com/os-autoinst/openQA/pull/6001 Reuse existing GruTask
Updated by okurz about 1 month ago
Updated by tinita about 1 month ago
I enabled the feature on o3 again and disabled the cronjob. Monitoring
Updated by tinita about 1 month ago
https://github.com/os-autoinst/openQA/pull/6015 Fix handling of job array in enqueue_git_clones
Updated by tinita about 1 month ago
https://github.com/os-autoinst/openQA/pull/6015 merged, enabled feature again on o3
Updated by okurz about 1 month ago
- Subject changed from Replace fetchneedles with a minion job size:M to Replace fetchneedles with a minion job for the regular update of git repos size:M
Updated by okurz about 1 month ago
- Copied to action #168376: Enable automatic openQA git clone instead of fetchneedles on OSD size:S added
Updated by tinita about 1 month ago
- Copied to action #168400: Improve locking scope of git_clone tasks size:S added
Updated by tinita about 1 month ago
- Status changed from In Progress to Resolved
Followup tickets created, resolving
Updated by tinita 27 days ago
- Related to action #124487: [openqa_logwarn] Can't call method "BUILD" on an undefined value at /usr/share/openqa/script/../lib/OpenQA/WebAPI/Plugin/AMQP.pm added