Project

General

Profile

action #93050

Proposal: Use openqaworker11 and openqaworker12 as normal workers and only pull out from production when necessary

Added by okurz 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
Low
Assignee:
Target version:
Start date:
2021-05-24
Due date:
% Done:

0%

Estimated time:

Description

Motivation

openqaworker11 and openqaworker12 are rarely used. We could make use of the additional capacity within our production infrastructure but easily take the machines out of production as we now have a good process to do so and automatically retrigger and reassign any openQA jobs when we switch them off or take them out of production

Acceptance criteria

  • AC1: Our wiki explains how to use the machines as "staging" workers for development or testing and how to bring them back into production
  • AC2: The machines are actively used in the production setup if not manually disabled

Suggestions

  • DONE: Ask the team -> no objections (after updating ACs)
  • Update salt pillar config accordingly
  • Apply full salt state and check for success
  • Crosscheck monitoring results
  • Teach the team how to use the machines as "staging" workers for development or testing

Related issues

Copied to openQA Infrastructure - action #94765: Bring openqaworker12 into production (w/o multi-machine test support) size:MWorkable

History

#1 Updated by okurz 5 months ago

  • Status changed from Workable to In Progress
  • Assignee set to okurz

#2 Updated by kraih 5 months ago

If we do this, it should also be documented somewhere how to remove the machines from production into a staging setup, and how to return them when done with testing. Not just because i will definitely forget how to do it, but also for new team members.

#3 Updated by kraih 5 months ago

Maybe worth mentioning that the staging machines tend to be left with unstable test packages installed. Dealing with that will also need a process.

#4 Updated by okurz 5 months ago

  • Description updated (diff)

kraih wrote:

If we do this, it should also be documented somewhere how to remove the machines from production into a staging setup, and how to return them when done with testing. Not just because i will definitely forget how to do it, but also for new team members.

absolutely. Thanks. I extended AC1 to cover "how to bring them back"

kraih wrote:

Maybe worth mentioning that the staging machines tend to be left with unstable test packages installed. Dealing with that will also need a process.

good point. Our production setup with salt should ensure that the setup should be updated an cleaned. I can think of a simple extension of our salt rules to make sure packages are from official repos. I don't think anyone needs to have outdated and unstable test packages for testing purposes as a base line, right?

#5 Updated by openqa_review 5 months ago

  • Due date set to 2021-06-11

Setting due date based on mean cycle time of SUSE QE Tools

#6 Updated by okurz 5 months ago

  • Status changed from In Progress to Feedback

#7 Updated by cdywan 5 months ago

Do we have instructions now on how to move workers between staging and production? I see no mention of that here.

#8 Updated by mkittler 5 months ago

I also think having such instructions figured out should be part of the ticket.

#9 Updated by okurz 5 months ago

mkittler wrote:

I also think having such instructions figured out should be part of the ticket.

yes, this is why we have "AC1: Our wiki explains how to use the machines as "staging" workers for development or testing and how to bring them back into production" or did you mean something else?

I am roughly thinking of the following:

Use staging machines for manual testing

ssh osd "sudo salt-key -y -d $hostname"
ssh $hostname "sudo systemctl disable --now telegraf openqa-worker-auto-restart@\*"

Bring back staging machines into production

ssh osd "sudo salt-key -a -d $hostname && sudo salt --state-output=changes $hostname state.apply"

#10 Updated by okurz 5 months ago

  • Priority changed from Normal to Low

#11 Updated by okurz 4 months ago

  • Description updated (diff)
  • Due date deleted (2021-06-11)

So far feedback from the team was positive so we can try. I might be able to continue with this low task myself eventually or give it back to the backlog for others to continue

#12 Updated by okurz 4 months ago

I added https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Take-machines-out-of-salt-controlled-production-eg-for-investigation-or-development . openqaworker12 was unable to boot. I manually disabled the incorrect entries in /etc/fstab, brought back the machine and manually, caused alerts, informed the team, stopped broken openqa worker instances. openqaworker12 can't connect to the openQA webUI due to invalidated or expired key. Updated the key from my user account on openqa-staging-1.qa.suse.de.

Created MR for bringing openqaworker11 and openqaworker12 configuration into salt pillar that includes config for OSD access:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/325

#13 Updated by okurz 4 months ago

This is weird:

# zypper -n in openQA-worker
Loading repository data...
Reading installed packages...
'openQA-worker' is already installed.
No update candidate for 'openQA-worker-4.6.1624461400.d2e48c03e-lp152.4096.1.noarch'. The highest available version is already installed.
Resolving package dependencies...
Nothing to do.
openqaworker12:/home/okurz # rpm -q openQA-worker
package openQA-worker is not installed

#14 Updated by okurz 4 months ago

  • Status changed from Feedback to In Progress

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/325 merged.

Manually ensured that a proper highstate is applied on openqaworker12. Encountered another package "ntp" that is reported to be installed but then not installed. Seems like rpm --rebuilddb helped here. Did another fix https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/326 and applied a clean high state. Now actual openQA jobs are picked up.

Many jobs have already passed. Also multi-machine jobs work, e.g. https://openqa.suse.de/tests/6332863

#15 Updated by okurz 4 months ago

  • Copied to action #94765: Bring openqaworker12 into production (w/o multi-machine test support) size:M added

#16 Updated by okurz 4 months ago

https://openqa.suse.de/tests/6338980#step/suseconnect_scc/73 on openqaworker12 looks like network within multi-machine jobs does not properly work. On openqaworker12 ovs-vsctl show shows

                options: {remote_ip="10.160.0.227"}
        Port gre6
            Interface gre6
                type: gre
                options: {remote_ip="10.160.2.20"}
                error: "could not add network device gre6 to ofproto (File exists)"

Maybe that's a problem then. Calling sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error' shows that this problem does not appear elsewhere. Also an error about openqaworker12: 'cmd.run' is not available. I disabled openQA operation on openqaworker12 with systemctl mask --now openqa-worker-cacheservice and did systemctl disable --now telegraf to prevent any false-alerts while the system is intentionally disabled. Called WORKER=openqaworker12 failed_since="2021-06-25" openqa-restart-incompletes-on-worker-instance to restart all incompletes on that worker. To cover all failed as well I did https://github.com/os-autoinst/scripts/pull/86 and called host=openqa.suse.de WORKER=openqaworker12 failed_since="2021-06-25" result=failed bash -ex ./openqa-advanced-retrigger-jobs

Created #94765 to cover openqaworker12. Let me try openqaworker11 then. Cleaned up some duplicate entries in /etc/zypp/repos.d, changed /etc/salt/minion to point to openqa.suse.de, restarted salt-minion, found in /var/log/salt/minion that the salt master pki key needs to be deleted, deleted that and restarted salt-minion again, applied high state from osd with sudo salt -l error --state-output=changes -C 'openqaworker11*' state.apply. This creates a faulty /etc/salt/repos.d/devel_openQA.repo with just a single line keeppackages=1

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/517

Unfortunately openQA jobs immediately started to be processed on openqaworker11 but packages were not up-to-date causing incompletes like https://openqa.suse.de/tests/6344536 due to an out-of-date os-autoinst. Calling zypper dup helped. Specific verification job: https://openqa.suse.de/tests/6345082#

ovs-vsctl show reveals that openqaworker11 has the same problem as openqaworker12, see #94765 . So with this I created a new ticket for openqaworker11 as well #94783

#17 Updated by okurz 4 months ago

  • Status changed from In Progress to Resolved

We accept the proposal. Both machines show to be working for single-machine tests but not yet multi-machine tests. We have specific tickets for both machines.

Also available in: Atom PDF