Improve handling of needle deprecation and deletion
While checking the latest logwarns I noticed the following behavior in openQA:
[Mon Dec 11 09:27:37 2017] [5339:warn] /var/lib/openqa/share/tests/sle/products/sle/needles/welcome-bsc1058099-20170915.json: No such file or directory […] [Mon Dec 11 09:27:51 2017] [5228:warn] /var/lib/openqa/share/tests/sle/products/sle/needles/welcome-bsc1058099-20170915.json: No such file or directory […] [Mon Dec 11 09:43:56 2017] [11050:warn] /var/lib/openqa/share/tests/sle/products/sle/needles/welcome-bsc1058099-20170915.json: No such file or directory [Mon Dec 11 09:43:56 2017] [11050:error] Could not parse needle: welcome-bsc1058099-20170915 for sle 15
I guess this happens because somebody viewed a Job in the webUI which used the "welcome-bsc1058099-20170915" while the needle was already deleted.
Later on, this code: https://github.com/os-autoinst/openQA/blob/8f30ca8dc7bc0b9bbb48a6507110fa06dd335aff/lib/OpenQA/WebAPI/Controller/Step.pm#L244 tried to access the needle - therefore the error message.
- AC1: Deleted needles do not trigger an error or warning when viewing test details
- AC2: Needles that had been matched during test runs but meanwhile deleted are obvious from the test details
- Check if needles exists on the FS before accessing it
- Properly handle non existent needles (e.g. do not blindly access them if they do not exist)
- Notify the user somehow that this job used a needle which was deleted meanwhile
- E.g. show in the needle selection: "99% old_needle_name (deleted meanwhile)"
- Disable the needle-compare-function if the reference does not exist anymore
- On top we could keep old needles around while tests are still running (which we effectively do with caching) when a deletion is "requested"
For now I'll add these two log-messages to the ignore-list of logwarn since they don't show a critical error (99% of the time) we've to react upon.
From lnussel in #34549 :
I need to retire pattern-minimalx-checked-boo1086058-20180320 and replace it with a correct one.
openQA does not provide a way to do that properly. I'd have to delete it manually from git which would break old test results. Would be better if there was a way to flag the needle so it's no longer used for new matches.
and more recent discussion about this:
[13 Jun 2018 09:11:03] <coolo> hands up! how broke tty2-selected needles? [13 Jun 2018 09:12:09] <okurz> coolo: could be me, link? [13 Jun 2018 09:12:29] <coolo> okurz: https://openqa.suse.de/group_overview/108 - spot the difference [13 Jun 2018 09:12:44] <coolo> delete mode 100644 tty2-selected-20151201.json [13 Jun 2018 09:12:44] <coolo> delete mode 100644 tty2-selected-20151201.png [13 Jun 2018 09:12:49] <coolo> looks related :) [13 Jun 2018 09:14:19] <okurz> coolo: I can't find the needle tty2-selected-20180613 I created over the webui [13 Jun 2018 09:14:24] <coolo> so you didn't even do a MR for it? [13 Jun 2018 09:14:58] <okurz> coolo: it's on osd. It's in a commit but that one has not been pushed [13 Jun 2018 09:15:08] <okurz> commit df6b97935a364e341f58590dea75486ae99ba704 [13 Jun 2018 09:15:13] <coolo> because you broke the automatic pushing by doing manual commit at the same time [13 Jun 2018 09:15:24] <coolo> anyway, if it's on osd it's broken - as breaking QAM [13 Jun 2018 09:15:51] <okurz> aren't we always commiting "at the same time"? [13 Jun 2018 09:16:35] <coolo> okurz: not so often from workstations [13 Jun 2018 09:16:41] <okurz> https://openqa.suse.de/tests/1759358#step/first_boot/3 is where it's used properly. so I guess I broke tests because they were running at the time of merging my delete MR and new needles are only picked up after restarting … [13 Jun 2018 09:16:42] <c4y> openQA: caasp-2.0-CaaSP-DVD-Incidents-x86_64-Build:7740:file.1528870658-QAM- [13 Jun 2018 09:16:57] <okurz> hm [13 Jun 2018 09:17:45] <coolo> but I give your theory a try and restart all fails [13 Jun 2018 09:18:31] <okurz> ok, pushed as geekotest from osd as workaround. I guess this asks for a long-term feature to improve the handling of how needles reach the system. Can I expect you to handle that? or should I create a ticket with the chat log? [13 Jun 2018 09:19:19] <coolo> okurz: I'm not following. if you created the needle before you deleted the old one, osd should never had a time where no tty2 needle existed [13 Jun 2018 09:19:27] <coolo> so I assume you broke it by first deleting and then creating [13 Jun 2018 09:20:15] <coolo> even though the timestamps of the git commits disagree [13 Jun 2018 09:20:17] <coolo> but I don't care [13 Jun 2018 09:23:17] <coolo> http://bugzilla.suse.com/show_bug.cgi?id=1086457#c54 - one more satisfied openqa user :) [13 Jun 2018 09:23:17] <c4y> - 1086457: ltp_aiodio_part test failures: BTRFS critical (device vda2): unable to find logical 8820195328 - https://bugzilla.suse.com/show_bug.cgi?id=1086457 [13 Jun 2018 09:31:32] <okurz> coolo: Deleting a needle has an effect immediately: [13 Jun 2018 09:31:32] <okurz> Could not open image /var/lib/openqa/cache/tests/sle/products/sle/needles/tty2-selected-20151201.png [13 Jun 2018 09:31:32] <okurz> [2018-06-13T07:50:55.0141 CEST] [debug] SKIP(tty2-selected-20151201:missing PNG) [13 Jun 2018 09:31:49] <okurz> in https://openqa.suse.de/tests/1759188/file/autoinst-log.txt but it should not have with proper caching to have a consistent set [13 Jun 2018 09:32:11] <coolo> okurz: ah, I know [13 Jun 2018 09:32:24] <okurz> let me pack that in a ticket [13 Jun 2018 09:32:25] <coolo> w1 synced the old state and started job [13 Jun 2018 09:32:33] <coolo> old needle was parsed in [13 Jun 2018 09:32:41] <coolo> w2 synced the new stated [13 Jun 2018 09:32:54] <coolo> w1 looked for his needle and didn't find it -> but ignored it