Project

General

Profile

coordination #80142

[saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Added by okurz 11 months ago. Updated 3 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-09-26
Due date:
2021-11-05
% Done:

67%

Estimated time:
(Total: 2.00 h)
Difficulty:

Description

Motivation

Single, production instances of the webui can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot. Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc. Nowadays a container based deployment becomes industry standard which we should fully support and prominently feature as supported both for simple single-instance setups of individual persons as well as multi-node setups in clusters

Acceptance criteria

  • AC1: an openQA infrastructure deployed on kubernetes is part of our continuous testing setup
  • AC2: documentation exists how to setup redundant load-balancing infrastructures
  • AC3: The support for openQA on container management frameworks is prominently presented
  • AC4: documentation exists for simple, personal single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

Suggestions

Based on the spike conducted in #69355 we can streamline the support, add documentation, introduce proper testing, consider running that setup as part of our DevOps structure, etc.

state-of-the-art is k8s so we should aim for that. Maybe a "docker compose" file is a good intermediate step, then k8s with a helm chart, potentially also some setup based on gitlab, see
https://docs.gitlab.com/ee/ci/environments/incremental_rollouts.html#blue-green-deployment


Subtasks

action #41600: fallback mechanism for apache, e.g. on osd New

coordination #43706: [epic] Generate "download&use" docker image of openQA for SUSE QABlockedokurz

action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webuiResolvedilausuch

action #43715: Update upstream dockerfiles to provide an easy to use docker image of workersResolvedilausuch

action #43718: Docker image for webui and workers are versioned and uploaded to obs registryResolvedilausuch

action #80516: Docker image for webui and workers on docker hub reflect current stateWorkable

action #80518: provide container images for aarch64New

action #80520: Automatic tests for our openQA containers - webUI onlyResolvedilausuch

action #80534: publication+demo for updated openQA containersBlockedokurz

action #80682: Automatic tests for our openQA containers - worker onlyResolvedilausuch

action #80684: Automatic tests for our openQA containers - worker+webui connectionWorkable

action #81118: automatic container tests for os-autoinstResolvedokurz

coordination #43934: [epic] Manage o3 infrastructure with salt againBlockedokurz

action #90164: Make gitlab.suse.de/openqa/salt-states-openqa publicResolvedokurz

action #90167: Setup initial salt infrastructure for remote management within o3Workable

action #91332: Allow to contribute to salt-states-openqa over githubWorkable

action #55262: Install Pgpool-II or PgBouncer before PostgreSQL for openQA instances, e.g. to be used on OSDNew

coordination #55364: [epic] Let's make codecov reports reliableResolvedokurz

action #71857: flaky/unstable/sporadic test coverage from t/34-developer_mode-unit.tResolvedcdywan

action #78043: unstable/flaky/inconsistent statement coverage in t/lib/OpenQA/SeleniumTestResolvedokurz

action #80268: Fix flaky coverage - t/05-scheduler-full.tResolvedokurz

action #80274: Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:MResolvedkraih

action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pmResolvedokurz

action #88594: codecov reports for individual files yield "forbidden" and files can not be shownResolvedokurz

action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.tResolvedkraih

action #69355: [spike] redundant/load-balancing webui deployments of openQAResolvedilausuch

openQA Infrastructure - action #70978: automatic reboots on o3 to activate new kernel versionsResolvedokurz

openQA Infrastructure - action #71098: openqaworker3 down but no alert was raisedResolvedokurz

openQA Infrastructure - action #73174: [osd][alert] Job age (scheduled) (median) alertResolvedokurz

action #73447: POC: Create openQA Web Application container image (feature)Resolvedilausuch

action #73450: POC: Create openQA worker container image (feature)Resolvedilausuch

openQA Infrastructure - action #76786: Configure static hostnames with salt for all salt nodesResolvedokurz

openQA Infrastructure - action #76876: Find a better (automated) way to inform infra about hanging (arm) workersResolvedcdywan

action #76990: Improve documentation for redundant/load-balancing webui deployments of openQAResolvedmkittler

coordination #77698: [epic] synchronous qemu based system level test in pull request CI runs, e.g. standalone isotovideo or openQA testsNew

action #77905: CI pipeline proof-of-concept running isotovideoResolvedokurz

openQA Infrastructure - coordination #78206: [epic] 2020-11-18 nbg power outage aftermathResolvedokurz

openQA Infrastructure - action #80540: idea: Conduct "power outage drills", e.g. once every half-year?Rejectedokurz

openQA Infrastructure - action #80542: Configure "automatic power-on" after power loss for openqaworker1 (and all others)Resolvedokurz

openQA Infrastructure - action #80544: Ensure that IPMI for powerqaworker-qam works reliablyResolvedokurz

openQA Infrastructure - action #92185: Our PowerPC worker machines need to be configured for automatic start after a power outageResolvedokurz

openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz

action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Resolvedmkittler

openQA Infrastructure - action #78438: openQA webui entry "Assigned worker" shows ip instead of names as formerly - manual cleanup workResolvedokurz

coordination #80150: [epic] Scale out openQA: Easier openQA setupBlockedokurz

action #76978: How to run an openQA test in 5 minutesNew

action #80382: Provide installation recipes for automatic installations of openQA worker machinesWorkable

action #89920: Extend existing openQA-in-openQA tests as a learning exercise to know where our instructions or beginner situation can be improvedNew

action #90038: Better error handling when reading API key+secret from ~/.config/openqa/client.confResolvedmkittler

coordination #90758: [epic] python bindings for openQAResolvedokurz

action #90761: os-autoinst-distri-opensuse CI checks fail due to `cpanm --installdeps` failing on Inline::PythonResolvedosukup

action #91257: try out python backend for production tests in a new test distribution or os-autoinst-distri-openQAResolvedcdywan

action #91653: Python tests fail with generic error message regardless of the problem size:MResolvedmkittler

coordination #93609: [epic] openqa-bootstrap support on Leap 15.3Blockedmkittler

action #95317: openqa-bootstrap on Leap 15.3 from devel:openQA passesResolvedmkittler

action #95320: openqa-bootstrap on Leap 15.3 from official repos passesResolvedmkittler

action #95323: openqa-bootstrap support on Leap 15.3 - automatic testNew

action #95185: openqa-bootstrap ignores errors when systemd is not availableResolveddheidler

openQA Infrastructure - action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setupWorkable

action #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobsNew

action #80910: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobsResolvedmkittler

action #80986: terminate worker process after executing all currently assigned jobs based on config/env variableResolvedmkittler

openQA Infrastructure - action #81884: openqa-webui should automatically restart on config updatesResolvedokurz

action #89200: Switch OSD deployment to two-daily deploymentResolvedmkittler

action #90152: module results missing on quick job (on auto-restarting worker)Resolvedmkittler

action #81060: Create a helm chart to deploy web UI in kubernetesBlockedokurz

action #87898: Add grafana alert for "broken workers" as reported by openQAResolvedmkittler

action #88187: Set the addresses in the "internal clients" configurableResolvedmkittler

coordination #89020: [epic] Support for multiple authentication providersWorkable

action #90929: get OAuth2 to work with salsa.debian.org (gitlab)Resolvedmkittler

action #89116: Enable openQA cloud IDE workflow out of the boxNew

action #89755: container: Fix missing shared directories and its permissionsResolved

coordination #89842: [epic] Scalable and streamlined docker-compose based openQA setupNew

action #89719: docker-compose up fails on masterResolvedilausuch

action #89722: Need automatic check for docker-composeResolvedilausuch

action #89731: containers: The deploy using docker-compose is not stable and eventually fails Resolvedilausuch

action #89749: containers: Allow to the user to choose the source of images in the docker-composeNew

action #89752: containers: Add a worker service as part of the docker-composeResolvedmkittler

coordination #90359: [epic] Customizable worker engineWorkable

action #90362: Allow to customize worker engine by configurationResolvedokurz

coordination #92037: [epic] Extensible openQA: Support for external openQA pluginsWorkable

action #92040: Support for external openQA plugins in home directoryNew

coordination #92575: [epic] feature switch capabilitiesNew

action #92578: infrastructure-instance wide feature switchesNew

coordination #92854: [epic] limit overload of openQA webUI by heavy requestsNew

action #93925: Optimize SQL on /tests/overviewResolvedtinita

action #94354: Optimize /dashboard_build_results and /group_overview/* pagesResolvedtinita

action #94667: Optimize products/machines/test_suites API callsNew

action #94705: Monitor number of SQL queries in grafanaNew

action #94753: Try out munin on o3Resolvedtinita

action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc.New

action #94255: containers: Improve the speed of the container test in CINew

openQA Infrastructure - action #97943: Increase number of CPU cores on OSD VM due to high usage size:SResolvedmkittler

coordination #98472: [epic] Scale out: Disaster recovery deployments of existing openQA infrastructuresBlockedokurz

coordination #99549: [epic] Split production workload onto multiple hosts (focusing on OSD)New


Related issues

Related to openQA Project - action #80466: docker: Base the webUI and worker Dockerfiles in TumbleweedWorkable2020-11-26

Related to openQA Project - action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:MResolved2021-05-20

History

#1 Updated by okurz 11 months ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA to [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes)
  • Status changed from Workable to Blocked
  • Assignee set to okurz

I created this saga as apparently the whole "container" feature block was not apparent enough for some stakeholders.

Some subtasks are there, ready to be worked on. Setting to "Blocked" by subtasks.

#2 Updated by okurz 11 months ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes) to [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
  • Description updated (diff)

#3 Updated by okurz 11 months ago

  • Tracker changed from action to coordination

#4 Updated by ilausuch 11 months ago

  • Related to action #80466: docker: Base the webUI and worker Dockerfiles in Tumbleweed added

#5 Updated by cdywan 11 months ago

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

#6 Updated by okurz 8 months ago

  • Description updated (diff)

cdywan wrote:

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

Right. I rephrased it so that it's clear in the motivation we mean "single, not-load-balanced production instances" in contrast to "personal instances that are fast to setup"

#7 Updated by cdywan 8 months ago

  • Related to action #89749: containers: Allow to the user to choose the source of images in the docker-compose added

#8 Updated by okurz 8 months ago

  • Related to deleted (action #89749: containers: Allow to the user to choose the source of images in the docker-compose)

#9 Updated by okurz 6 months ago

  • Subject changed from [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes to [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

#10 Updated by mkittler 8 days ago

  • Related to action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:M added

Also available in: Atom PDF