coordination #65271
Updated by okurz over 3 years ago
* Offer a way for jobs to dynamically schedule children #12876
* Offer publishing a snapshotted qcow from developer mode: "Lea has trouble reproducing a bug reported from openQA and Tim would like
to help by providing a snapshot of a VM reproducing the bug. For this Tim goes into developer mode, pauses at the module, triggers
the bug even by stepping through the test or by doing it through VNC to then publish a snapshotted VM on openqa." #41045
* In order to aid debugging when watching the video recording of a test it would be nice to display other information at the same time such as, line of test executed, key presses and mouse movements sent to the SuT, serial output from the SuT, misc measurements depending on backend such as cpu load, power consumption etc. For that to work the information could probably be embedded into the video using streams and then displayed in the browser by using some javascript #34345
* Supply data folder to qemu VMs using https://qemu.weilnetz.de/doc/qemu-doc.html#disk_005fimages_005ffat_005fimages instead of extracting a data folder for all tests
* Document data structures of files written by os-autoinst, As part of our retrospective we covered how the failures discovered in #57419 were handled, and the data structures should be better documented #64117
* Validate input in WebUI (testsuite names etc.). Currently it is even possible to add a testsuite ' '. The rules should match those of the jobtemplate editor #62861
* worker complains about "empty" workers.ini with error but works just fine regardless with default worker class #59294#note-2
* proposal: link to staging dashboards and maintenance incidents with links in settings page. With #15384 resolved we can followup with the original idea to link to staging dashboards or maintenance incidents. What about putting something like STAGING_DASHBOARD=https://build.suse.de/project/staging_projects/SUSE:SLE-%VERSION%:GA/D on https://openqa.suse.de/admin/products for "Server-DVD-D-Staging" and accordingly for other projects? We could do it with ObsRsync as well I guess. #58025
* make sure os-autoinst multi-machine docs are referenced (and not duplicated) in open.qa/docs . http://open.qa/docs/#_multi_machine_tests_setup has some information. https://github.com/os-autoinst/os-autoinst/blob/master/doc/networking.md has more but seems to be not referenced on the former. Also https://github.com/os-autoinst/os-autoinst/blob/master/doc/openvswitch-init-example is nice but also does not seem to be referenced anywhere #54782
* Allow running tests of openQA itself in parallel. There is already the file testrules.yml in the openQA repository which lists tests which are supposed to be able to run in parallel. It would be nice if there was a Makefile target which allows to run tests in parallel according to that list. Note that the -j flag has no effect on the current test target. #53954
* Add a description to every OpenQA module. In https://github.com/os-autoinst/openQA/pull/1922 i've proposed that we add at least some very basic documentation to every Perl module in the openQA repo (Synopsis and Description). And i think we've reached agreement that this is a reasonable first step towards making it easier for new programmers to get started with working on the openQA code. #45293
* [tools][s390x] Extend 'expect_3270' functionality to provide similiar Info output as 'wait_serial' does. In consoles/s3270.pm is the 'wait_serial' implementation for s390-zVM which works quite well, but compared to wait_serial it's not providing well visualized output like here:
https://openqa.suse.de/tests/1157233#step/reconnect_s390/2 #25188
* Improve documentation and usage examples of AMQP plugin. Since #15086 we have a AMQP plugin and we already use it for internal IRC notifications. Our documentation does not mention this anywhere and also there are no examples available or referenced in the openQA repo itself. **AC1:** The use of the AMQP plugin should be covered by at least one of documentation, example, reference implementation #27841
* Hyper-V relies on NFS to get images. On Hyper-V images are copied from ad hoc mounted NFS server to local cache on disk D. See bootloader_hyperv. #32299
* Add automated check for isotovideo::INTERFACE . To avoid situations where changes to the testapi or [locking api](https://github.com/os-autoinst/os-autoinst/pull/964) don't bump the interface version, a check should be added. As a very rudimentary idea could be looking for changes in those two modules (looking for the subs, check if there is one more in comparison to the previous version? and then simply fail the test if there's no change to the interface version either. #36250
* Display the command to spawn a VM for virtualization backends. Certain backends use a command to spawn a VM that a test developer or a third party would be interested in using to spawn/respawn the VM, regardless of the backend (Qemu for now would be the main target). Having the webUI to display the command used to spawn the vm would be helpful. #36601
* [backend] Make HDD and PUBLISH_HDD variables platform-independent (qcow suffix). HDD & PUBLISH_HDD_1 variables in test suite definition are usually set like this: HDD_1=sle-12-SP3-Server-DVD-%ARCH%-allpatterns.qcow2 That is, they expect the image to go with qcow2 suffix. That is incompatible with platforms which export other formats (like VHDX for Hyper-V). In some cases this can be workarounded in runtime like s/qcow2/vhdx/ in bootloader_hyper.pm, but still the original HDD file with .qcow2 has to exist, otherwise job will be canceled by openQA. The other workaround is to "deep copy" the test suite to ${testsuite}_hyperv (e.g. om_proxyscc_sles12sp3_allpatterns_full_update_by_yast_hyperv on OSD) and go with this one for Hyper-V. #39209
* Add verification of keyboard events sent through vnc. To improve the missing keys behavior, the following patch has been added to qemu (Basically it works only in debug mode): https://bugzilla.suse.com/show_bug.cgi?id=1031692 , http://lists.nongnu.org/archive/html/qemu-devel/2017-03/msg06018.html #18620
* implement favorite job groups. openqa meanwhile has lots of job groups. it would be nice if I could select the ones I am interested in and the order I want them to be displayed on the main page. That would only work for logged in users of course. #12042
* add linuxvnc based console. linuxvnc allows to export tty device over vnc. With this openQA can handle backends without actual graphic cards present, such as IPMI machines and exotic architectures, see https://github.com/LibVNC/vncterm #11566
* Use Virgil to test 3d accelerated graphics in openQA. https://virgil3d.github.io/ and https://blog.martin-graesslin.com/blog/2019/11/setting-up-a-virtual-machine-with-virgil-support/ #10680
* Restrict workers to have full API access by random, individual tokens. workers currently use user tokens and therefore have access to the full API. Workers should have differnt privileges that only allows them to grab jobs, set jobs to finish etc. That makes sure test code run by a worker cannot accidentally nor intentionally mess with the API. #6096
* Suppress the prompt sign in the "expected output" of serial consoles (or virtio-console). For reviewers of serial console (virtio) based tests the current ">_" thumbnails appear for all the executed command, the output as well as the prompt sign which is not really adding valuable information and should be supressed. #45143
* Improve code for updating test modules on job details page while running. The way thumbnail reloads are triggered needs clarification. The update algorithm might also miss test modules completely (e.g. the `isosize` module is often not updated correctly).
* Restarting the worker cache while an openqa worker instance waits for assets the worker instance will be stuck until restarted. The worker should not be stuck, e.g. abort with timeout or be informed about the cache service to terminate
* openQA became very good at restarting all tests correctly when a worker machine goes down. How about we try to suspend tests and VMs on the workers when shutting down and try to resume after reboot? If resume fails then just start over, i.e. in the currently save way of "abandoning" the job which should be retriggered automatically. Similar to what jenkins does on shutdown to not accept new jobs but still finish old ones.
* custom install mentions web proxy script but does not pull in apache2 as dependency or does not report with a helpful message
* nightly CI job that upgrades dependencies additionally from CPAN and runs openQA tests #65975
* Ensure jobs are regarded as passed if all modules passed but time is exceeded #65657
* Sometimes tests are unstable and fail and it is not easy to fix or test maintainers just do not know how to do that. Add a RETRY option for individual test modules as well as complete scenarios
* Save space in openqa by keeping dot of job results in database for longer but delete everything on filesystem (maybe even move database content to slow different database); quota for results and logs to show that important builds take up more space #66922#note-1
* Add support for SSH from Host to SUT VM, e.g. using qemu parameters like `-net user,hostfwd=tcp::7777-:8001` and also start an ssh server within SUT and unblock in firewall #31417
* Calculate/check checksum of generated/used assets, especially fixed assets, at least show in vars.json so that "investigation" shows any difference #67288
* Generic bugref in comments. Find a generic way to define bugrefs without having the individual issue tracker URL specified within the openQA source code, e.g. with `bug:<url>` #38432
* Include failed tests with closed issues assigned in another level of the "TODO test overview" as these tests are relevant for reviewers. Also please keep in mind that in many cases a label linking a "closed" issue is actually still valid, e.g. a "RESOLVED FIXED" bug but the bugfix has not reached the product. #48998
* filter and group by product issue / test issue, e.g. in build overview bar to distinguish "failed due to product issue" from "failed due to test issue"
* Add support to add manual users that are not managed over external identify provider, e.g. dedicated "users" for workers over API, optionally API #6558
* openqa-scheduler should not stay "running" but inoperative on `Mojo::Reactor::Poll: Timer failed: Can't open database lock file /var/lib/openqa/db/db.lock! at /usr/share/openqa/script/../lib/OpenQA/Schema.pm line 87.` -> #69523
* Improve displaying serial results (further ideas mentioned in #70648)
* If the test (soft)failed, collapse green test modules (except for a linked #anchor), those aren't the main point of interest anyway
* Treat successive terminal output/input more like an actual terminal and use an inline line break instead of separate boxes
* If the output matched, only show the output (with leading and trailing newlines removed) and not the "# wait_serial expected..." and "result:" lines
* If the space is too narrow for the "Report bug" buttons, hide them until the relevant step is clicked or hovered
* Allow expanding all serial results (within a test module) at once
* I'd keep the boxes as quick links into the single long terminal view. When you click the box, it'll open the terminal view and jump to the location of the corresponding output.
* I'd much prefer to get rid of the # wait_serial expected: blahblahblah as much as possible. (See #70648#note-7 for details.)
* Improve asset upload performance from worker to webui, so optimisations like the [local upload feature](https://github.com/os-autoinst/openQA/pull/3325) are not necessary anymore
* (Semi-)automatic cleanup of old needles: In very active needle repos outdated needles should be deleted from time to time when they are not matched or used for long. We have a UI for that but we could improve with automation. Also see https://chat.suse.de/channel/testing?msg=rYLtgxCr4a7GeKTsh and following messages
* Generate I/O salt monitoring and alert panels dynamically, see #70834
* cache service
* Prevent purging of assets before the worker has the chance to hardlink the asset. (See #71827 and #71827#note-22 in particular.)
* As discussed in SUSE QE Tools weekly 2020-11-20 we encountered the problem that some team member were never subscribed to certain projects, mailing lists, etc. . We should look for a single source of truth for team roster and then at best automatically derive everything else from that.
* Add configuration option to give "operator" or "admin" permissions to user by default, e.g. for trusted instances or public instances that are meant to be used for personal use cases
* Improve logs analysis experience:
* Colorize the logs of the "Logs & Assets".
* Add filter option to show only the lines that match one criteria ( e.g. criticality)
* Add a time range selector to show only the lines between
* Add a filter to search text by field (component, message, ...)
* NOTE: Take in consideration [lnav](https://lnav.org/)
* Monitoring check based on a passive performance measurement regarding throughput on network interfaces (#78127) e.g. grafana alerts
* Try to post auto-review comments as "auto-review", not "geekotest" with corresponding user account keys, etc. , e.g. check what user "geekotest" is doing, either just use new api-key and secret or sudo? (#77899)
* Show results from openQA investigation jobs directly on /tests/overview (based on discussion between okurz and dimstar in https://matrix.to/#/!ZzdoIxkFyLxnxddLqM:matrix.org/$1612021254698548YsMQD:matrix.org?via=matrix.org ("my usual worlflow in the morning is the todo view, looking at the build fails/modules. maybe if the investihation job info would be visible there directly insteda of being behind a few clicks note that i do this initial task usually on the phone screen before moving to the office. i would rather think of 'status result symbols'. interesting would be if the rerun failed the same way or different or success in case of same fail, i could likely skip the manual rerun, different or success would still make my rerun interesting"
* Warn if sum of configured quotas on https://openqa.opensuse.org/admin/assets (excluding "Untracked") exceeds available space on filesystem
* Execute tests in containers instead of VMs
* Provide [proof of readiness](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) with our container definitions
* Provide API to setup, spawn and inspect containers (currently has to be done via CLI)
* Similar as for job details reflect tests overview summary within favicon, e.g. same color as the summary box, red if any failed, and so on
* Provide links to JSON representations of pages directly on the page. Similar to what redmine does in the footer of pages: "Also available in: Atom/PDF/JSON/XML"
* Consider using https://prettier.io/ for files like .json, .css, .yaml, .html
* openqa scheduling with clone job or jobs post uses _GROUP=0 to schedule a job outside any group, with isos post the parameter is used to limit the matching to a group
* in https://progress.opensuse.org/issues/67006 the problem was that there are too many needles so that the check takes very long. We could add an option to `check_or_assert_screen` to make that fatal for time-critical checks, e.g. grub2 with timeout
* email notifications for done jobs -> specify list of notification recipients in tests and send notification
* run kubernetes on internal cloud and runners on there, especially non-qemu ones, to free workers to run qemu and move to o3
* feedback from SUSE QE Tools workshop 2021-03-19: asmorodskyi: limitation in multi-machine tests: TAP devices need to be handled outside the tests and be in line with tests. So the setup is difficult and error prone. Would be nice to have API calls to create/manage TAP devices or automatically create TAP devices based on test variables that specify the requirements
* Create openQA and/or os-autoinst tutorial using https://cloud.google.com/shell/docs/cloud-shell-tutorials/tutorials
* Rework ambiguous term "Online" in /admin/workers , see https://chat.suse.de/channel/testing?msg=Dizwgosx8P3QbcPiS for an incident of confusion about the meaning of "Online". Currently "Online" means "Idle" as in reactive to requests but not currently working on jobs. Proposals: Rename "Online" to "Idle" and define a new "Not-Offline" and soon rename that to "Online". Or define a new aggregated state like "[Online, Working]" or "[Idle, Working]"
* Similar to `__ONLY_OBSOLETE_SAME_BUILD` we could have `__ONLY_OBSOLETE_OTHER_BUILDS`, see #90131