action #93050
closedProposal: Use openqaworker11 and openqaworker12 as normal workers and only pull out from production when necessary
Added by okurz over 3 years ago. Updated over 3 years ago.
0%
Description
Motivation¶
openqaworker11 and openqaworker12 are rarely used. We could make use of the additional capacity within our production infrastructure but easily take the machines out of production as we now have a good process to do so and automatically retrigger and reassign any openQA jobs when we switch them off or take them out of production
Acceptance criteria¶
- AC1: Our wiki explains how to use the machines as "staging" workers for development or testing and how to bring them back into production
- AC2: The machines are actively used in the production setup if not manually disabled
Suggestions¶
- DONE: Ask the team -> no objections (after updating ACs)
- Update salt pillar config accordingly
- Apply full salt state and check for success
- Crosscheck monitoring results
- Teach the team how to use the machines as "staging" workers for development or testing
Updated by okurz over 3 years ago
- Status changed from Workable to In Progress
- Assignee set to okurz
Updated by kraih over 3 years ago
If we do this, it should also be documented somewhere how to remove the machines from production into a staging setup, and how to return them when done with testing. Not just because i will definitely forget how to do it, but also for new team members.
Updated by kraih over 3 years ago
Maybe worth mentioning that the staging machines tend to be left with unstable test packages installed. Dealing with that will also need a process.
Updated by okurz over 3 years ago
- Description updated (diff)
kraih wrote:
If we do this, it should also be documented somewhere how to remove the machines from production into a staging setup, and how to return them when done with testing. Not just because i will definitely forget how to do it, but also for new team members.
absolutely. Thanks. I extended AC1 to cover "how to bring them back"
kraih wrote:
Maybe worth mentioning that the staging machines tend to be left with unstable test packages installed. Dealing with that will also need a process.
good point. Our production setup with salt should ensure that the setup should be updated an cleaned. I can think of a simple extension of our salt rules to make sure packages are from official repos. I don't think anyone needs to have outdated and unstable test packages for testing purposes as a base line, right?
Updated by openqa_review over 3 years ago
- Due date set to 2021-06-11
Setting due date based on mean cycle time of SUSE QE Tools
Updated by livdywan over 3 years ago
Do we have instructions now on how to move workers between staging and production? I see no mention of that here.
Updated by mkittler over 3 years ago
I also think having such instructions figured out should be part of the ticket.
Updated by okurz over 3 years ago
mkittler wrote:
I also think having such instructions figured out should be part of the ticket.
yes, this is why we have "AC1: Our wiki explains how to use the machines as "staging" workers for development or testing and how to bring them back into production" or did you mean something else?
I am roughly thinking of the following:
Use staging machines for manual testing¶
ssh osd "sudo salt-key -y -d $hostname"
ssh $hostname "sudo systemctl disable --now telegraf openqa-worker-auto-restart@\*"
Bring back staging machines into production¶
ssh osd "sudo salt-key -a -d $hostname && sudo salt --state-output=changes $hostname state.apply"
Updated by okurz over 3 years ago
- Description updated (diff)
- Due date deleted (
2021-06-11)
So far feedback from the team was positive so we can try. I might be able to continue with this low task myself eventually or give it back to the backlog for others to continue
Updated by okurz over 3 years ago
I added https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Take-machines-out-of-salt-controlled-production-eg-for-investigation-or-development . openqaworker12 was unable to boot. I manually disabled the incorrect entries in /etc/fstab, brought back the machine and manually, caused alerts, informed the team, stopped broken openqa worker instances. openqaworker12 can't connect to the openQA webUI due to invalidated or expired key. Updated the key from my user account on openqa-staging-1.qa.suse.de.
Created MR for bringing openqaworker11 and openqaworker12 configuration into salt pillar that includes config for OSD access:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/325
Updated by okurz over 3 years ago
This is weird:
# zypper -n in openQA-worker
Loading repository data...
Reading installed packages...
'openQA-worker' is already installed.
No update candidate for 'openQA-worker-4.6.1624461400.d2e48c03e-lp152.4096.1.noarch'. The highest available version is already installed.
Resolving package dependencies...
Nothing to do.
openqaworker12:/home/okurz # rpm -q openQA-worker
package openQA-worker is not installed
Updated by okurz over 3 years ago
- Status changed from Feedback to In Progress
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/325 merged.
Manually ensured that a proper highstate is applied on openqaworker12. Encountered another package "ntp" that is reported to be installed but then not installed. Seems like rpm --rebuilddb
helped here. Did another fix https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/326 and applied a clean high state. Now actual openQA jobs are picked up.
Many jobs have already passed. Also multi-machine jobs work, e.g. https://openqa.suse.de/tests/6332863
Updated by okurz over 3 years ago
- Copied to action #94765: Bring openqaworker12 into production (w/o multi-machine test support) size:M added
Updated by okurz over 3 years ago
https://openqa.suse.de/tests/6338980#step/suseconnect_scc/73 on openqaworker12 looks like network within multi-machine jobs does not properly work. On openqaworker12 ovs-vsctl show
shows
options: {remote_ip="10.160.0.227"}
Port gre6
Interface gre6
type: gre
options: {remote_ip="10.160.2.20"}
error: "could not add network device gre6 to ofproto (File exists)"
Maybe that's a problem then. Calling sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error'
shows that this problem does not appear elsewhere. Also an error about openqaworker12: 'cmd.run' is not available.
I disabled openQA operation on openqaworker12 with systemctl mask --now openqa-worker-cacheservice
and did systemctl disable --now telegraf
to prevent any false-alerts while the system is intentionally disabled. Called WORKER=openqaworker12 failed_since="2021-06-25" openqa-restart-incompletes-on-worker-instance
to restart all incompletes on that worker. To cover all failed as well I did https://github.com/os-autoinst/scripts/pull/86 and called host=openqa.suse.de WORKER=openqaworker12 failed_since="2021-06-25" result=failed bash -ex ./openqa-advanced-retrigger-jobs
Created #94765 to cover openqaworker12. Let me try openqaworker11 then. Cleaned up some duplicate entries in /etc/zypp/repos.d, changed /etc/salt/minion to point to openqa.suse.de, restarted salt-minion, found in /var/log/salt/minion that the salt master pki key needs to be deleted, deleted that and restarted salt-minion again, applied high state from osd with sudo salt -l error --state-output=changes -C 'openqaworker11*' state.apply
. This creates a faulty /etc/salt/repos.d/devel_openQA.repo with just a single line keeppackages=1
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/517
Unfortunately openQA jobs immediately started to be processed on openqaworker11 but packages were not up-to-date causing incompletes like https://openqa.suse.de/tests/6344536 due to an out-of-date os-autoinst. Calling zypper dup
helped. Specific verification job: https://openqa.suse.de/tests/6345082#
ovs-vsctl show
reveals that openqaworker11 has the same problem as openqaworker12, see #94765 . So with this I created a new ticket for openqaworker11 as well #94783
Updated by okurz over 3 years ago
- Status changed from In Progress to Resolved
We accept the proposal. Both machines show to be working for single-machine tests but not yet multi-machine tests. We have specific tickets for both machines.