coordination #65271
Updated by okurz almost 3 years ago
* Offer a way for jobs to dynamically schedule children #12876 * Offer publishing a snapshotted qcow from developer mode: "Lea has trouble reproducing a bug reported from openQA and Tim would like to help by providing a snapshot of a VM reproducing the bug. For this Tim goes into developer mode, pauses at the module, triggers the bug even by stepping through the test or by doing it through VNC to then publish a snapshotted VM on openqa." #41045 * Save and publish snapshots of failing modules before rollback (on demand), see #104805, also see #90347 #104805 * In order to aid debugging when watching the video recording of a test it would be nice to display other information at the same time such as, line of test executed, key presses and mouse movements sent to the SuT, serial output from the SuT, misc measurements depending on backend such as cpu load, power consumption etc. For that to work the information could probably be embedded into the video using streams and then displayed in the browser by using some javascript #34345 * Supply data folder to qemu VMs using https://qemu.weilnetz.de/doc/qemu-doc.html#disk_005fimages_005ffat_005fimages instead of extracting a data folder for all tests * Document data structures of files written by os-autoinst, As part of our retrospective we covered how the failures discovered in #57419 were handled, and the data structures should be better documented #64117 * Validate input in WebUI (testsuite names etc.). Currently it is even possible to add a testsuite ' '. The rules should match those of the jobtemplate editor #62861 * worker complains about "empty" workers.ini with error but works just fine regardless with default worker class #59294#note-2 * proposal: link to staging dashboards and maintenance incidents with links in settings page. With #15384 resolved we can followup with the original idea to link to staging dashboards or maintenance incidents. What about putting something like STAGING_DASHBOARD=https://build.suse.de/project/staging_projects/SUSE:SLE-%VERSION%:GA/D on https://openqa.suse.de/admin/products for "Server-DVD-D-Staging" and accordingly for other projects? We could do it with ObsRsync as well I guess. #58025 * make sure os-autoinst multi-machine docs are referenced (and not duplicated) in open.qa/docs . http://open.qa/docs/#_multi_machine_tests_setup has some information. https://github.com/os-autoinst/os-autoinst/blob/master/doc/networking.md has more but seems to be not referenced on the former. Also https://github.com/os-autoinst/os-autoinst/blob/master/doc/openvswitch-init-example is nice but also does not seem to be referenced anywhere #54782 * Allow running tests of openQA itself in parallel. There is already the file testrules.yml in the openQA repository which lists tests which are supposed to be able to run in parallel. It would be nice if there was a Makefile target which allows to run tests in parallel according to that list. Note that the -j flag has no effect on the current test target. #53954 * Add a description to every OpenQA module. In https://github.com/os-autoinst/openQA/pull/1922 i've proposed that we add at least some very basic documentation to every Perl module in the openQA repo (Synopsis and Description). And i think we've reached agreement that this is a reasonable first step towards making it easier for new programmers to get started with working on the openQA code. #45293 * [tools][s390x] Extend 'expect_3270' functionality to provide similiar Info output as 'wait_serial' does. In consoles/s3270.pm is the 'wait_serial' implementation for s390-zVM which works quite well, but compared to wait_serial it's not providing well visualized output like here: https://openqa.suse.de/tests/1157233#step/reconnect_s390/2 #25188 * Improve documentation and usage examples of AMQP plugin. Since #15086 we have a AMQP plugin and we already use it for internal IRC notifications. Our documentation does not mention this anywhere and also there are no examples available or referenced in the openQA repo itself. **AC1:** The use of the AMQP plugin should be covered by at least one of documentation, example, reference implementation #27841 * Hyper-V relies on NFS to get images. On Hyper-V images are copied from ad hoc mounted NFS server to local cache on disk D. See bootloader_hyperv. #32299 * Add automated check for isotovideo::INTERFACE . To avoid situations where changes to the testapi or [locking api](https://github.com/os-autoinst/os-autoinst/pull/964) don't bump the interface version, a check should be added. As a very rudimentary idea could be looking for changes in those two modules (looking for the subs, check if there is one more in comparison to the previous version? and then simply fail the test if there's no change to the interface version either. #36250 * Display the command to spawn a VM for virtualization backends. Certain backends use a command to spawn a VM that a test developer or a third party would be interested in using to spawn/respawn the VM, regardless of the backend (Qemu for now would be the main target). Having the webUI to display the command used to spawn the vm would be helpful. #36601 * [backend] Make HDD and PUBLISH_HDD variables platform-independent (qcow suffix). HDD & PUBLISH_HDD_1 variables in test suite definition are usually set like this: HDD_1=sle-12-SP3-Server-DVD-%ARCH%-allpatterns.qcow2 That is, they expect the image to go with qcow2 suffix. That is incompatible with platforms which export other formats (like VHDX for Hyper-V). In some cases this can be workarounded in runtime like s/qcow2/vhdx/ in bootloader_hyper.pm, but still the original HDD file with .qcow2 has to exist, otherwise job will be canceled by openQA. The other workaround is to "deep copy" the test suite to ${testsuite}_hyperv (e.g. om_proxyscc_sles12sp3_allpatterns_full_update_by_yast_hyperv on OSD) and go with this one for Hyper-V. #39209 * Add verification of keyboard events sent through vnc. To improve the missing keys behavior, the following patch has been added to qemu (Basically it works only in debug mode): https://bugzilla.suse.com/show_bug.cgi?id=1031692 , http://lists.nongnu.org/archive/html/qemu-devel/2017-03/msg06018.html #18620 * implement favorite job groups. openqa meanwhile has lots of job groups. it would be nice if I could select the ones I am interested in and the order I want them to be displayed on the main page. That would only work for logged in users of course. #12042 * add linuxvnc based console. linuxvnc allows to export tty device over vnc. With this openQA can handle backends without actual graphic cards present, such as IPMI machines and exotic architectures, see https://github.com/LibVNC/vncterm #11566 * Use Virgil to test 3d accelerated graphics in openQA. https://virgil3d.github.io/ and https://blog.martin-graesslin.com/blog/2019/11/setting-up-a-virtual-machine-with-virgil-support/ #10680 * Restrict workers to have full API access by random, individual tokens. workers currently use user tokens and therefore have access to the full API. Workers should have differnt privileges that only allows them to grab jobs, set jobs to finish etc. That makes sure test code run by a worker cannot accidentally nor intentionally mess with the API. #6096 * Suppress the prompt sign in the "expected output" of serial consoles (or virtio-console). For reviewers of serial console (virtio) based tests the current ">_" thumbnails appear for all the executed command, the output as well as the prompt sign which is not really adding valuable information and should be supressed. #45143 * Improve code for updating test modules on job details page while running. The way thumbnail reloads are triggered needs clarification. The update algorithm might also miss test modules completely (e.g. the `isosize` module is often not updated correctly). * Restarting the worker cache while an openqa worker instance waits for assets the worker instance will be stuck until restarted. The worker should not be stuck, e.g. abort with timeout or be informed about the cache service to terminate * openQA became very good at restarting all tests correctly when a worker machine goes down. How about we try to suspend tests and VMs on the workers when shutting down and try to resume after reboot? If resume fails then just start over, i.e. in the currently save way of "abandoning" the job which should be retriggered automatically. Similar to what jenkins does on shutdown to not accept new jobs but still finish old ones. * custom install mentions web proxy script but does not pull in apache2 as dependency or does not report with a helpful message * nightly CI job that upgrades dependencies additionally from CPAN and runs openQA tests #65975 * Ensure jobs are regarded as passed if all modules passed but time is exceeded #65657 * Sometimes tests are unstable and fail and it is not easy to fix or test maintainers just do not know how to do that. Add a RETRY option for individual test modules as well as complete scenarios * Save space in openqa by keeping dot of job results in database for longer but delete everything on filesystem (maybe even move database content to slow different database); quota for results and logs to show that important builds take up more space #66922#note-1 * Add support for SSH from Host to SUT VM, e.g. using qemu parameters like `-net user,hostfwd=tcp::7777-:8001` and also start an ssh server within SUT and unblock in firewall #31417 * Calculate/check checksum of generated/used assets, especially fixed assets, at least show in vars.json so that "investigation" shows any difference #67288 * Generic bugref in comments. Find a generic way to define bugrefs without having the individual issue tracker URL specified within the openQA source code, e.g. with `bug:<url>` #38432 * Include failed tests with closed issues assigned in another level of the "TODO test overview" as these tests are relevant for reviewers. Also please keep in mind that in many cases a label linking a "closed" issue is actually still valid, e.g. a "RESOLVED FIXED" bug but the bugfix has not reached the product. #48998 * filter and group by product issue / test issue, e.g. in build overview bar to distinguish "failed due to product issue" from "failed due to test issue" * Add support to add manual users that are not managed over external identify provider, e.g. dedicated "users" for workers over API, optionally API #6558 * openqa-scheduler should not stay "running" but inoperative on `Mojo::Reactor::Poll: Timer failed: Can't open database lock file /var/lib/openqa/db/db.lock! at /usr/share/openqa/script/../lib/OpenQA/Schema.pm line 87.` -> #69523 * Improve displaying serial results (further ideas mentioned in #70648) * If the test (soft)failed, collapse green test modules (except for a linked #anchor), those aren't the main point of interest anyway * Treat successive terminal output/input more like an actual terminal and use an inline line break instead of separate boxes * If the output matched, only show the output (with leading and trailing newlines removed) and not the "# wait_serial expected..." and "result:" lines * If the space is too narrow for the "Report bug" buttons, hide them until the relevant step is clicked or hovered * Allow expanding all serial results (within a test module) at once * I'd keep the boxes as quick links into the single long terminal view. When you click the box, it'll open the terminal view and jump to the location of the corresponding output. * I'd much prefer to get rid of the # wait_serial expected: blahblahblah as much as possible. (See #70648#note-7 for details.) * Improve asset upload performance from worker to webui, so optimisations like the [local upload feature](https://github.com/os-autoinst/openQA/pull/3325) are not necessary anymore * (Semi-)automatic cleanup of old needles: In very active needle repos outdated needles should be deleted from time to time when they are not matched or used for long. We have a UI for that but we could improve with automation. Also see https://chat.suse.de/channel/testing?msg=rYLtgxCr4a7GeKTsh and following messages * Generate I/O salt monitoring and alert panels dynamically, see #70834 * cache service * Prevent purging of assets before the worker has the chance to hardlink the asset. (See #71827 and #71827#note-22 in particular.) * As discussed in SUSE QE Tools weekly 2020-11-20 we encountered the problem that some team member were never subscribed to certain projects, mailing lists, etc. . We should look for a single source of truth for team roster and then at best automatically derive everything else from that. * Add configuration option to give "operator" or "admin" permissions to user by default, e.g. for trusted instances or public instances that are meant to be used for personal use cases * Improve logs analysis experience: * Colorize the logs of the "Logs & Assets". * Add filter option to show only the lines that match one criteria ( e.g. criticality) * Add a time range selector to show only the lines between * Add a filter to search text by field (component, message, ...) * NOTE: Take in consideration [lnav](https://lnav.org/) * suggestion from ybonatakis: [Kafka+Elk](https://logz.io/blog/deploying-kafka-with-elk/) * Monitoring check based on a passive performance measurement regarding throughput on network interfaces (#78127) e.g. grafana alerts * Try to post auto-review comments as "auto-review", not "geekotest" with corresponding user account keys, etc. , e.g. check what user "geekotest" is doing, either just use new api-key and secret or sudo? (#77899) * Show results from openQA investigation jobs directly on /tests/overview (based on discussion between okurz and dimstar in https://matrix.to/#/!ZzdoIxkFyLxnxddLqM:matrix.org/$1612021254698548YsMQD:matrix.org?via=matrix.org ("my usual worlflow in the morning is the todo view, looking at the build fails/modules. maybe if the investihation job info would be visible there directly instead of being behind a few clicks note that i do this initial task usually on the phone screen before moving to the office. i would rather think of 'status result symbols'. interesting would be if the rerun failed the same way or different or success in case of same fail, i could likely skip the manual rerun, different or success would still make my rerun interesting" * Warn if sum of configured quotas on https://openqa.opensuse.org/admin/assets (excluding "Untracked") exceeds available space on filesystem * Execute tests in containers instead of VMs * Provide [proof of readiness](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) with our container definitions * Provide API to setup, spawn and inspect containers (currently has to be done via CLI) * Similar as for job details reflect tests overview summary within favicon, e.g. same color as the summary box, red if any failed, and so on * Provide links to JSON representations of pages directly on the page. Similar to what redmine does in the footer of pages: "Also available in: Atom/PDF/JSON/XML" * Consider using https://prettier.io/ for files like .json, .css, .yaml, .html * openqa scheduling with clone job or jobs post uses _GROUP=0 to schedule a job outside any group, with isos post the parameter is used to limit the matching to a group * in https://progress.opensuse.org/issues/67006 the problem was that there are too many needles so that the check takes very long. We could add an option to `check_or_assert_screen` to make that fatal for time-critical checks, e.g. grub2 with timeout * email notifications for done jobs -> specify list of notification recipients in tests and send notification * run kubernetes on internal cloud and runners on there, especially non-qemu ones, to free workers to run qemu and move to o3 * feedback from SUSE QE Tools workshop 2021-03-19: asmorodskyi: limitation in multi-machine tests: TAP devices need to be handled outside the tests and be in line with tests. So the setup is difficult and error prone. Would be nice to have API calls to create/manage TAP devices or automatically create TAP devices based on test variables that specify the requirements * Create openQA and/or os-autoinst tutorial using https://cloud.google.com/shell/docs/cloud-shell-tutorials/tutorials * Similar to `__ONLY_OBSOLETE_SAME_BUILD` we could have `__ONLY_OBSOLETE_OTHER_BUILDS`, see #90131 * fix build cycle in https://github.com/os-autoinst/openQA/blob/master/dist/rpm/openQA.spec openQA requiring %test_requires requires %worker_requires requires openQA-client which is built by openQA (see https://matrix.to/#/!ZzdoIxkFyLxnxddLqM:matrix.org/$1617012235540396JPrIK:matrix.org?via=matrix.org&via=opensuse.org ) * [API call that allows to search for test suites](https://progress.opensuse.org/issues/38807#note-48) by one or more settings? and possibly to filter either with perl expressions/operators or something that allows such behavior see #38807 * URLs on /admin/productlog should be clickable as well, e.g. https://openqa.suse.de/admin/productlog?id=695518 shows a URL. On other pages the URLs are already parsed to be clickable * Skip investigation (within the `openqa-investigate` script) when the developer mode has been opened (as the failure is likely a unique result of messing with the test manually) * show number of skipped test modules per job in /tests/overview as well, see https://chat.suse.de/channel/testing?msg=zxABmL5wiSFuDpM3H * By default the needle editor is making it too easy for users to "just create a new needle" when in many cases more elaborate choices are a better alternative. For example a slight font rendering difference is solved by most(?) needle creators by creating an additional needle with default match level instead of replacing existing needles with a slightly lower match level to cover the slight rendering differences. Motivated by discussion in https://chat.suse.de/channel/testing?msg=jYW7YwKn32c8WA9v6 * Would be nice to complete the job started with #91527 to get a complete human-readable and parser-readable autoinst-log where all the new lines starts with the same "[date] [level] ..." format or with two spaces for multi-line entries. See the detected cases at https://progress.opensuse.org/issues/91527#note-9 * Unassign jobs back to the openQA scheduler if the specific backend reports that test requirements can not be met, e.g. if not enough memory is available, see https://progress.opensuse.org/issues/93112#note-1 * isos post request processing should be more strict and decline requests when not all dependencies are met ( e.g. missing `START_AFTER_TEST` job ) * Offer an API route to update job settings, e.g. to patch "WORKER_CLASS" if a worker class is replaced by another machine * In the openQA needle editor make the possibility and suitability of lower-than-default match ratio visible. Also, the ratio is for each area in the needle, the UI doesn't highlight if a needle is created with lower or upper match ratio. So consider highlighting that a needle is created with different match ratio than default? #94231 * When trying to update existing openQA medium types which are used in job templates an error message is shown `Bad Request: Group Test must be updated through the YAML template`. Instead the error message should explain that better that if you want to change anything but settings and that medium is referenced in job templates then you would invalidate job templates which needs to have that product string exactly matching the parameters. You would need to at least temporarily prevent the product to be used anywhere. Likely the easiest way is to define a *new* medium, then update job templates to use the new product, then delete the old medium * In multi-machine test scenarios when `barrier_wait` is called before `barrier_init` the according jobs fail. This can happen e.g. if the job intended to call `barrier_init` is slower to startup due to asset syncing, setup, etc. . Instead of failing jobs could wait in `barrier_wait` until the according barrier is initialized (suggested by acarvajal+okurz in SUSE QE Tools workshop 2021-07-16) * Job details already point to what triggered the jobs with a link to auditlog, e.g. `Scheduled product: [opensuse-Tumbleweed-DVD-x86_64-20210714](https://openqa.opensuse.org/admin/productlog?id=220381)`. For retriggered jobs it says `Scheduled product: job has not been created by posting an ISO (but possibly the [original job](link))`. Would be nice for investigation what account triggered a job (suggested by acarvajal+foursixnine+okurz in SUSE QE Tools workshop 2021-07-16) * clone_job logic needs to verify HDD_* / ISO_* to cover case when asset was deleted by garbage collection. Important point that verification should check not actual test assets vs. variables in settings , but actual assets **on target openQA server** vs. variable settings . Which allow cloning of the job anyway after manually downloading missing asset * configuration applicable to all jobs for one openQA instance -> #95299#note-17 * optimize or limit grep expressions in os-autoinst/scripts to prevent overload when long log files are grepped, e.g. only allow a subset of regex groups, limit the size that we read from logfiles, etc. -> #96713 * The list of scheduled products can grow quite large. On OSD we still have products from 6 years ago and the huge table has very long loading times. It would make sense to have a cleanup for this table, similar to how we cleanup audit events. * Jobs can take very long uploading and downloading assets if there are many concurrent job using assets, e.g. https://openqa.suse.de/tests/7126196 ran from 13:40 to 14:14 (28%), uploading from 14:14 to 15:43 (72%) . Here the limitation of openQA and test designs based on relatively big qcow files (4.7 GB) shows: Tests can run in parallel on many workers, the assets all need to be uploaded to the central storage and downloaded again from there which does not scale that well. mdoucha suggests the following three features: * automatically put uploaded assets into local cache service * make scheduler aware of worker cache contents and prioritize jobs which will use existing cached assets * allow worker slots to start a new job while the previous one is still uploading results * https://gitlab.suse.de/qa-maintenance/bot-ng/-/blob/c0a66f2f190b2ac4a2d5c7097a0a1656a9873b61/openqabot/approver.py#L100 should provide more details in the message about based on what results the approval was done (https://suse.slack.com/archives/C02CANHLANP/p1631783883288200 and following messages. Specific suggestion by coolo: "I would welcome if the dashboard-openqa review comment was a little more detailed. Not a comment that triggers a mail notification, but just like the human reviewers leave a log file with details. "approved" is a bit hard to challenge a week after") * Feedback from SUSE QE Tools workshop 2021-09-17: * add explanation or link to explanation below automatic investigation job comments, e.g. point to https://github.com/os-autoinst/scripts/#openqa-investigate---automatic-investigation-jobs-with-failure-analysis-in-openqa * show who changed prio values, e.g. manually on the jobs (apappas) * allow drag-and-drop changes of priority, e.g. moving scheduled job to the top (okurz) * proposal by asmorodskyi, not sure if it's good: Only have pre-defined priority steps like "now", "real-time", "soon", "later" :) * Originally okurz thought that in https://github.com/os-autoinst/openQA/pull/1135 we would show up "$version-$build" on job group overview pages when multiple builds for different versions show up or even different versions with the same value for build. We should show up the version if the current dataset is ambiguous in this regard. * Ensure that assets are correctly generated/uploaded/downloaded with checksums to prevent problems of "wrong format", see #98727 * try to rephrase "Blocking SIGCHLD and SIGTERM" debug messages to make them less scary. Also they are repeated twice * on job group overview in popup record times of how many jobs have been restarted if any, also if some jobs have forced results * Use syncthing (or another synchronisation solution or distributed filesystem) to sync tests and shared assets to workers to not overload a central instance so much? * Combine builds for multiple versions depending on configuration. Idea from jlausuch in #53264#note-13: it would be nice that same build numbers are grouped together instead of having different entries for the same build as you can see in the image https://progress.opensuse.org/attachments/12412/wicked_view.png * Feedback from SUSE QE Tools workshop 2022-01-28: * vpelcak proposed that force_result should not be applicable or bot-ng should still not approve if there are any skipped modules. idea from mdoucha show module results like from /tests on /tests/overview so that reviewers can see skipped modules. I wouldn't block on skipped modules but just stay on "blocked if jobs fail". Then the logic within the tests can still decide what effect skipped modules should have. What I took from the workshop is that we can try to make the presence of skipped modules more prominent. For example showing an icon like the "ban" icon we already use on /tests next to the failed modules but only if there are any failed modules * from fniederwanger there could be warnings that are shown on the level of test overview. okurz suggests to display as flash icon for test issues like in other cases. We could try to extend github references to git permalinks. maybe it already works like this * question from vpelcak and okurz: do reviewers look into the details of jobs which modules are skipped before approving the overall release? * custom github actions for triggering openQA jobs based on "@openqa" mention * to ease auto-approval in SUSE maintenance tests "openqa-fail-softener": search all failed tests that have bugref, update result to soft-failed if test issue or bug with low severity && !blocker * Add section to documentation how to better investigate failed jobs, e.g. including FORCE_PUBLISH_HDD, see #104805 * From #103317#note-5 the idea was for jobs to show how many jobs are in the backlog, could be both in total and for the specific worker class settings. This would allow to better know what to expect, identify misconfigurations, worker class bottlenecks, etc. I envision a message like `scheduled (since about 2 hours, 1234 jobs in the queue for the same worker class). Also we could estimate how long it took for jobs in the same worker class in the past