action #93050: Proposal: Use openqaworker11 and openqaworker12 as normal workers and only pull out from production when necessary - openQA Infrastructure (public) - openSUSE Project Management Tool

Actions

Copy link

action #93050

closed

Proposal: Use openqaworker11 and openqaworker12 as normal workers and only pull out from production when necessary

Added by okurz almost 4 years ago. Updated over 3 years ago.

Status:

Resolved

Priority:

Low

Assignee:

okurz

Category:

Target version:

openQA Project (public) - Ready

Start date:

2021-05-24

Due date:

% Done:

Estimated time:

Description

Motivation¶

openqaworker11 and openqaworker12 are rarely used. We could make use of the additional capacity within our production infrastructure but easily take the machines out of production as we now have a good process to do so and automatically retrigger and reassign any openQA jobs when we switch them off or take them out of production

Acceptance criteria¶

AC1: Our wiki explains how to use the machines as "staging" workers for development or testing and how to bring them back into production
AC2: The machines are actively used in the production setup if not manually disabled

Suggestions¶

DONE: Ask the team -> no objections (after updating ACs)
Update salt pillar config accordingly
Apply full salt state and check for success
Crosscheck monitoring results
Teach the team how to use the machines as "staging" workers for development or testing

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by okurz almost 4 years ago

Status changed from Workable to In Progress
Assignee set to okurz

Actions

Copy link

Updated by kraih almost 4 years ago

If we do this, it should also be documented somewhere how to remove the machines from production into a staging setup, and how to return them when done with testing. Not just because i will definitely forget how to do it, but also for new team members.

Actions

Copy link

Updated by kraih almost 4 years ago

Maybe worth mentioning that the staging machines tend to be left with unstable test packages installed. Dealing with that will also need a process.

Actions

Copy link

Updated by okurz almost 4 years ago

Description updated (diff)

kraih wrote:

If we do this, it should also be documented somewhere how to remove the machines from production into a staging setup, and how to return them when done with testing. Not just because i will definitely forget how to do it, but also for new team members.

absolutely. Thanks. I extended AC1 to cover "how to bring them back"

kraih wrote:

Maybe worth mentioning that the staging machines tend to be left with unstable test packages installed. Dealing with that will also need a process.

good point. Our production setup with salt should ensure that the setup should be updated an cleaned. I can think of a simple extension of our salt rules to make sure packages are from official repos. I don't think anyone needs to have outdated and unstable test packages for testing purposes as a base line, right?

Actions

Copy link

Updated by openqa_review almost 4 years ago

Due date set to 2021-06-11

Setting due date based on mean cycle time of SUSE QE Tools

Actions

Copy link

Updated by okurz almost 4 years ago

Status changed from In Progress to Feedback

Actions

Copy link

Updated by livdywan almost 4 years ago

Do we have instructions now on how to move workers between staging and production? I see no mention of that here.

Actions

Copy link

Updated by mkittler almost 4 years ago

I also think having such instructions figured out should be part of the ticket.

Actions

Copy link

Updated by okurz almost 4 years ago

mkittler wrote:

I also think having such instructions figured out should be part of the ticket.

yes, this is why we have "AC1: Our wiki explains how to use the machines as "staging" workers for development or testing and how to bring them back into production" or did you mean something else?

I am roughly thinking of the following:

Use staging machines for manual testing¶

ssh osd "sudo salt-key -y -d $hostname"
ssh $hostname "sudo systemctl disable --now telegraf openqa-worker-auto-restart@\*"

Bring back staging machines into production¶

ssh osd "sudo salt-key -a -d $hostname && sudo salt --state-output=changes $hostname state.apply"

Actions

Copy link

#10

Updated by okurz almost 4 years ago

Priority changed from Normal to Low

Actions

Copy link

#11

Updated by okurz over 3 years ago

Description updated (diff)
Due date deleted (~~2021-06-11~~)

So far feedback from the team was positive so we can try. I might be able to continue with this low task myself eventually or give it back to the backlog for others to continue

Actions

Copy link

#12

Updated by okurz over 3 years ago

I added https://progress.opensuse.org/projects/openqav3/wiki/Wiki#Take-machines-out-of-salt-controlled-production-eg-for-investigation-or-development . openqaworker12 was unable to boot. I manually disabled the incorrect entries in /etc/fstab, brought back the machine and manually, caused alerts, informed the team, stopped broken openqa worker instances. openqaworker12 can't connect to the openQA webUI due to invalidated or expired key. Updated the key from my user account on openqa-staging-1.qa.suse.de.

Created MR for bringing openqaworker11 and openqaworker12 configuration into salt pillar that includes config for OSD access:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/325

Actions

Copy link

#13

Updated by okurz over 3 years ago

This is weird:

# zypper -n in openQA-worker
Loading repository data...
Reading installed packages...
'openQA-worker' is already installed.
No update candidate for 'openQA-worker-4.6.1624461400.d2e48c03e-lp152.4096.1.noarch'. The highest available version is already installed.
Resolving package dependencies...
Nothing to do.
openqaworker12:/home/okurz # rpm -q openQA-worker
package openQA-worker is not installed

Actions

Copy link

#14

Updated by okurz over 3 years ago

Status changed from Feedback to In Progress

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/325 merged.

Manually ensured that a proper highstate is applied on openqaworker12. Encountered another package "ntp" that is reported to be installed but then not installed. Seems like rpm --rebuilddb helped here. Did another fix https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/326 and applied a clean high state. Now actual openQA jobs are picked up.

Many jobs have already passed. Also multi-machine jobs work, e.g. https://openqa.suse.de/tests/6332863

Actions

Copy link

#15

Updated by okurz over 3 years ago

Copied to action #94765: Bring openqaworker12 into production (w/o multi-machine test support) size:M added

Actions

Copy link

#16

Updated by okurz over 3 years ago

https://openqa.suse.de/tests/6338980#step/suseconnect_scc/73 on openqaworker12 looks like network within multi-machine jobs does not properly work. On openqaworker12 ovs-vsctl show shows

                options: {remote_ip="10.160.0.227"}
        Port gre6
            Interface gre6
                type: gre
                options: {remote_ip="10.160.2.20"}
                error: "could not add network device gre6 to ofproto (File exists)"

Maybe that's a problem then. Calling sudo salt -l error --state-output=changes -C 'G@roles:worker' cmd.run 'ovs-vsctl show | grep -C 3 error' shows that this problem does not appear elsewhere. Also an error about openqaworker12: 'cmd.run' is not available. I disabled openQA operation on openqaworker12 with systemctl mask --now openqa-worker-cacheservice and did systemctl disable --now telegraf to prevent any false-alerts while the system is intentionally disabled. Called WORKER=openqaworker12 failed_since="2021-06-25" openqa-restart-incompletes-on-worker-instance to restart all incompletes on that worker. To cover all failed as well I did https://github.com/os-autoinst/scripts/pull/86 and called host=openqa.suse.de WORKER=openqaworker12 failed_since="2021-06-25" result=failed bash -ex ./openqa-advanced-retrigger-jobs

Created #94765 to cover openqaworker12. Let me try openqaworker11 then. Cleaned up some duplicate entries in /etc/zypp/repos.d, changed /etc/salt/minion to point to openqa.suse.de, restarted salt-minion, found in /var/log/salt/minion that the salt master pki key needs to be deleted, deleted that and restarted salt-minion again, applied high state from osd with sudo salt -l error --state-output=changes -C 'openqaworker11*' state.apply. This creates a faulty /etc/salt/repos.d/devel_openQA.repo with just a single line keeppackages=1

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/517

Unfortunately openQA jobs immediately started to be processed on openqaworker11 but packages were not up-to-date causing incompletes like https://openqa.suse.de/tests/6344536 due to an out-of-date os-autoinst. Calling zypper dup helped. Specific verification job: https://openqa.suse.de/tests/6345082#

ovs-vsctl show reveals that openqaworker11 has the same problem as openqaworker12, see #94765 . So with this I created a new ticket for openqaworker11 as well #94783

Actions

Copy link

#17

Updated by okurz over 3 years ago

Status changed from In Progress to Resolved

We accept the proposal. Both machines show to be working for single-machine tests but not yet multi-machine tests. We have specific tickets for both machines.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

QA (public) » openQA Project (public) » openQA Infrastructure (public)

Tags

Custom queries

action #93050

Proposal: Use openqaworker11 and openqaworker12 as normal workers and only pull out from production when necessary

Motivation¶

Acceptance criteria¶

Suggestions¶

Updated by okurz almost 4 years ago

Updated by kraih almost 4 years ago

Updated by kraih almost 4 years ago

Updated by okurz almost 4 years ago

Updated by openqa_review almost 4 years ago

Updated by okurz almost 4 years ago

Updated by livdywan almost 4 years ago

Updated by mkittler almost 4 years ago

Updated by okurz almost 4 years ago

Use staging machines for manual testing¶

Bring back staging machines into production¶

Updated by okurz almost 4 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago

Updated by okurz over 3 years ago