Project

General

Profile

coordination #80142

[saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Added by okurz over 1 year ago. Updated about 7 hours ago.

Status:
Blocked
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-06-29
Due date:
2022-10-01
% Done:

66%

Estimated time:
(Total: 2.00 h)
Difficulty:

Description

Motivation

Single, production instances of the webui can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot. Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc. Nowadays a container based deployment becomes industry standard which we should fully support and prominently feature as supported both for simple single-instance setups of individual persons as well as multi-node setups in clusters

Acceptance criteria

  • AC1: an openQA infrastructure deployed on kubernetes is part of our continuous testing setup
  • AC2: documentation exists how to setup redundant load-balancing infrastructures
  • AC3: The support for openQA on container management frameworks is prominently presented
  • AC4: documentation exists for simple, personal single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

Suggestions

Based on the spike conducted in #69355 we can streamline the support, add documentation, introduce proper testing, consider running that setup as part of our DevOps structure, etc.

state-of-the-art is k8s so we should aim for that. Maybe a "docker compose" file is a good intermediate step, then k8s with a helm chart, potentially also some setup based on gitlab, see
https://docs.gitlab.com/ee/ci/environments/incremental_rollouts.html#blue-green-deployment


Subtasks

action #41600: fallback mechanism for apache, e.g. on osd New

coordination #43706: [epic] Generate "download&use" docker image of openQA for SUSE QABlockedokurz

action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webuiResolvedilausuch

action #43715: Update upstream dockerfiles to provide an easy to use docker image of workersResolvedilausuch

action #43718: Docker image for webui and workers are versioned and uploaded to obs registryResolvedilausuch

action #80516: Docker image for webui and workers on docker hub reflect current stateWorkable

action #80518: provide container images for aarch64New

action #80520: Automatic tests for our openQA containers - webUI onlyResolvedilausuch

action #80534: publication+demo for updated openQA containersBlockedokurz

action #80682: Automatic tests for our openQA containers - worker onlyResolvedilausuch

action #80684: Automatic tests for our openQA containers - worker+webui connectionWorkable

action #81118: automatic container tests for os-autoinstResolvedokurz

coordination #43934: [epic] Manage o3 infrastructure with salt againBlockedokurz

action #90164: Make gitlab.suse.de/openqa/salt-states-openqa publicResolvedokurz

action #90167: Setup initial salt infrastructure for remote management within o3Workable

action #91332: Allow to contribute to salt-states-openqa over githubWorkable

action #55262: Install Pgpool-II or PgBouncer before PostgreSQL for openQA instances, e.g. to be used on OSDNew

coordination #55364: [epic] Let's make codecov reports reliableResolvedokurz

action #71857: flaky/unstable/sporadic test coverage from t/34-developer_mode-unit.tResolvedcdywan

action #78043: unstable/flaky/inconsistent statement coverage in t/lib/OpenQA/SeleniumTestResolvedokurz

action #80268: Fix flaky coverage - t/05-scheduler-full.tResolvedokurz

action #80274: Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:MResolvedkraih

action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pmResolvedokurz

action #88594: codecov reports for individual files yield "forbidden" and files can not be shownResolvedokurz

action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.tResolvedkraih

action #69355: [spike] redundant/load-balancing webui deployments of openQAResolvedilausuch

openQA Infrastructure - action #70978: automatic reboots on o3 to activate new kernel versionsResolvedokurz

openQA Infrastructure - action #71098: openqaworker3 down but no alert was raisedResolvedokurz

openQA Infrastructure - action #73174: [osd][alert] Job age (scheduled) (median) alertResolvedokurz

action #73447: POC: Create openQA Web Application container image (feature)Resolvedilausuch

action #73450: POC: Create openQA worker container image (feature)Resolvedilausuch

openQA Infrastructure - action #76786: Configure static hostnames with salt for all salt nodesResolvedokurz

openQA Infrastructure - action #76876: Find a better (automated) way to inform infra about hanging (arm) workersResolvedcdywan

action #76990: Improve documentation for redundant/load-balancing webui deployments of openQAResolvedmkittler

coordination #77698: [epic] synchronous qemu based system level test in pull request CI runs, e.g. standalone isotovideo or openQA testsNew

action #77905: CI pipeline proof-of-concept running isotovideoResolvedokurz

openQA Infrastructure - coordination #78206: [epic] 2020-11-18 nbg power outage aftermathResolvedokurz

openQA Infrastructure - action #80540: idea: Conduct "power outage drills", e.g. once every half-year?Rejectedokurz

openQA Infrastructure - action #80542: Configure "automatic power-on" after power loss for openqaworker1 (and all others)Resolvedokurz

openQA Infrastructure - action #80544: Ensure that IPMI for powerqaworker-qam works reliablyResolvedokurz

openQA Infrastructure - action #92185: Our PowerPC worker machines need to be configured for automatic start after a power outageResolvedokurz

openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz

action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Resolvedmkittler

openQA Infrastructure - action #78438: openQA webui entry "Assigned worker" shows ip instead of names as formerly - manual cleanup workResolvedokurz

coordination #80150: [epic] Scale out openQA: Easier openQA setupBlockedokurz

action #76978: How to run an openQA test in 5 minutesNew

action #80382: Provide installation recipes for automatic installations of openQA worker machinesWorkable

action #89920: Extend existing openQA-in-openQA tests as a learning exercise to know where our instructions or beginner situation can be improvedNew

action #90038: Better error handling when reading API key+secret from ~/.config/openqa/client.confResolvedmkittler

coordination #90758: [epic] python bindings for openQAResolvedokurz

action #90761: os-autoinst-distri-opensuse CI checks fail due to `cpanm --installdeps` failing on Inline::PythonResolvedosukup

action #91257: try out python backend for production tests in a new test distribution or os-autoinst-distri-openQAResolvedcdywan

action #91653: Python tests fail with generic error message regardless of the problem size:MResolvedmkittler

coordination #93609: [epic] openqa-bootstrap support on Leap 15.3Resolvedmkittler

action #95317: openqa-bootstrap on Leap 15.3 from devel:openQA passesResolvedmkittler

action #95320: openqa-bootstrap on Leap 15.3 from official repos passesResolvedmkittler

action #95323: openqa-bootstrap support on current version of Leap - automatic test size:SResolvedmkittler

action #95185: openqa-bootstrap ignores errors when systemd is not availableResolveddheidler

openQA Infrastructure - action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setupWorkable

coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobsResolvedokurz

action #80910: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobsResolvedmkittler

action #80986: terminate worker process after executing all currently assigned jobs based on config/env variableResolvedmkittler

openQA Infrastructure - action #81884: openqa-webui should automatically restart on config updatesResolvedokurz

action #89200: Switch OSD deployment to two-daily deploymentResolvedmkittler

action #90152: module results missing on quick job (on auto-restarting worker)Resolvedmkittler

action #104178: Increase OSD deployment rate from every second day to dailyResolvedokurz

action #104841: Prevent empty changelog messages from osd-deployment when there are no changes size:MResolvedmkittler

action #105379: Continuous deployment of o3 workers - one worker first size:MResolvedmkittler

action #105885: Continuous deployment of o3 workers - all the other o3 workers size:MResolvedmkittler

action #111028: Continuous update of o3 webUIResolvedokurz

action #111377: Continuous deployment of osd workers - similar as on o3 size:MRejectedokurz

coordination #81060: [epic] openQA web UI in kubernetesBlockedokurz

action #110524: [timeboxed:20h][spike] openQA proof-of-concept within kubernetes size:MResolvedjbaier_cz

action #110725: Unexpected behavior for cache service under k3s when the CACHE_MIN_FREE_PERCENTAGE is set size:MWorkable

action #111323: Simplified web proxy setup (remove path rewrite)Feedbackjbaier_cz

action #111329: openQA within kubernetes with tested helm chartsNew

action #87898: Add grafana alert for "broken workers" as reported by openQAResolvedmkittler

action #88187: Set the addresses in the "internal clients" configurableResolvedmkittler

coordination #89020: [epic] Support for multiple authentication providersWorkable

action #90929: get OAuth2 to work with salsa.debian.org (gitlab)Resolvedmkittler

action #89116: Enable openQA cloud IDE workflow out of the boxNew

action #89755: container: Fix missing shared directories and its permissionsResolved

coordination #89842: [epic] Scalable and streamlined docker-compose based openQA setupNew

action #89719: docker-compose up fails on masterResolvedilausuch

action #89722: Need automatic check for docker-composeResolvedilausuch

action #89731: containers: The deploy using docker-compose is not stable and eventually fails Resolvedilausuch

action #89749: containers: Allow to the user to choose the source of images in the docker-composeNew

action #89752: containers: Add a worker service as part of the docker-composeResolvedmkittler

coordination #90359: [epic] Customizable worker engineWorkable

action #90362: Allow to customize worker engine by configurationResolvedokurz

coordination #92037: [epic] Extensible openQA: Support for external openQA pluginsWorkable

action #92040: Support for external openQA plugins in home directoryNew

coordination #92575: [epic] feature switch capabilitiesNew

action #92578: infrastructure-instance wide feature switchesNew

coordination #92854: [epic] limit overload of openQA webUI by heavy requestsBlockedokurz

action #93925: Optimize SQL on /tests/overviewResolvedtinita

action #94354: Optimize /dashboard_build_results and /group_overview/* pagesResolvedtinita

action #94667: Optimize products/machines/test_suites API callsNew

action #94705: Monitor number of SQL queries in grafanaNew

action #94753: Try out munin on o3Resolvedtinita

action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc.Blockedokurz

action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MBlockedcdywan

action #110677: Investigation page shouldn't involve blocking long-running API routes size:MResolvedtinita

action #110680: Overview page shouldn't allow long-running requests without limits size:MIn Progresskraih

action #94255: containers: Improve the speed of the container test in CINew

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alertBlockedokurz

action #70774: save_needle Minion tasks fail frequentlyNew

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"Blockedmkittler

action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:MResolvedmkittler

action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolvedokurz

action #107533: Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:MResolvedmkittler

action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::DownloadNew

action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew

action #108989: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::NeedleNew

action #99837: configurable exclusion rules for /influxdb/minionNew

action #100503: Identify all "finalize_job_results" failures and handle them (report ticket or fix)Resolvedcdywan

openQA Infrastructure - action #97943: Increase number of CPU cores on OSD VM due to high usage size:SResolvedmkittler

coordination #98472: [epic] Scale out: Disaster recovery deployments of existing openQA infrastructuresBlockedokurz

coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errorsBlockedmkittler

action #102578: [sporadic] t/full-stack.t Failed test 'Expected result for job 1 not found' size:MResolvedokurz

coordination #102581: Proof of concept of t/full-stack.t on GitHubActionsResolvedcdywan

action #105429: openQA's fullstack test fails in `shutdown` moduleResolvedmkittler

action #106912: Fullstack test can still fail due to `shutdown` module size:MResolvedokurz

action #107002: Expose fullstack test video from pool directory in CI size:MResolvedcdywan

action #107941: [sporadic] openQA Fullstack test t/full-stack.t can still fail with "udevadm" log message size:MWorkable

coordination #99549: [epic] Split production workload onto multiple hosts (focusing on OSD)New

coordination #99579: [epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"Blockedokurz

action #99654: Revisit decision in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/545 regarding I/O alerts size:SResolvedmkittler

coordination #99660: [epic] Use more perl signatures in our perl projectsBlockedokurz

action #99663: Use more perl signatures - os-autoinst size:MFeedbackokurz

action #100967: Use more perl signatures - openQA size:MWorkableokurz

action #105127: Use more perl signatures - openQA - some simple classes size:SResolvedkodymo

openQA Infrastructure - coordination #102266: [epic] o3 ran out of disk spaceResolvedokurz

openQA Infrastructure - action #104217: Ask eng infra why thruk.suse.de stopped workingResolvedokurz

coordination #102951: [epic] Better network performance monitoringResolvedokurz

action #102957: Better network performance monitoring - up-/download speed from cache service, e.g. in log file size:MResolvedkraih

action #106901: Expose bandwidth data for worker cache via influxdb size:MResolvedkraih

action #106904: Monitoring for worker specific bandwidth size:MResolvedkraih

coordination #103122: [epic] Test distribution template for openQA tests to use, e.g. for automatic PR tests in os-autoinst-distri-openQABlockedokurz

action #103125: compile tests in os-autoinst-distri-openQA defined in template project size:SWorkable

coordination #106922: [epic][sporadic] openqa_from_git fails in dashboard due to ensure_unlocked_desktop not expecting password entry screen in case of locked desktop auto_review:"match=desktop-runner,screenlock timed out.*":retryBlockedokurz

action #107701: [osd] Job detail page fails to loadResolvedtinita

coordination #108527: [epic] os-autoinst plugins (or wheels or leaves or scrolls) for scalable code reuse of helper functions and segmented test distributionsBlockedokurz

action #81899: [easy][beginner] Move code from isotovideo to a module size:MWorkableokurz

action #108530: os-autoinst plugins: x11_start_program from os-autoinst-distri-openQA dynamically loaded from another git repo size:MBlockedcdywan

openQA Infrastructure - action #108740: qa-power8-5-kvm minions alert is heart-broken 💔️Rejectedokurz

openQA Infrastructure - action #108743: qa-power8-5-kvm minions alert is heart-brokenResolvednicksinger

coordination #109659: [epic] More remote workersBlockedokurz

openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:MResolvedmkittler

action #107497: [qe-tools] openqaworker14 (and openqa15) - developer mode fails size:MWorkable

openQA Infrastructure - action #110467: Establish reliable tap setup on ow14New

action #110515: Command export feature for openqa-clone-job size:MResolvedmkittler

openQA Infrastructure - action #97862: More openQA worker hardware for OSD size:MBlockedokurz

openQA Infrastructure - action #94765: Bring openqaworker12 into production (w/o multi-machine test support) size:MWorkable

openQA Infrastructure - action #94783: Bring openqaworker11 into production including multi-machine test support (same as w12)New

openQA Infrastructure - action #111473: Get replacement for existing machines, e.g. imagetester, openqaworker1 (potentially more)New

coordination #101048: [epic] Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3Blockedmkittler

action #101262: Document running os-autoinst full-stack.t on OSD workers size:MResolvedokurz

action #101265: Upgrade arm3 to Leap 15.3 and compare failure rate size:MResolvedmkittler

openQA Infrastructure - action #101271: Try Kernel:stable on arm4+arm5 and compare failure rate size:MResolvedkraih

openQA Infrastructure - action #104304: Crosscheck results of https://github.com/os-autoinst/os-autoinst#verifying-a-runtime-environment on arm-1/2/3 vs. arm-4/5 to find out if arm-4/5 are "typing stable" size:MResolvedmkittler

action #109232: Document relevant differences of arm-4/5 vs. arm-1/2/3 and aarch64.o.o, involve domain experts in asking what parameters are important to be able to run openQA tests size:MResolvedmkittler

openQA Infrastructure - action #109494: Restore network connection of arm-4/5 size:MResolvednicksinger

openQA Infrastructure - action #110539: Ask OBS team if they would like to swap ARM workers with usNew

action #110542: Try to mitigate "VNC typing issues" with disabled key repeatNew

openQA Infrastructure - action #110545: Investigate and fix higher instability of openqaworker-arm-4/5 vs. arm-1/2/3 - further things to try size:MWorkableokurz

openQA Infrastructure - action #105594: Two new machines for OSD and o3, meant for bare-metal virtualization size:MFeedbacknicksinger

openQA Infrastructure - action #106832: Monitor masked units on our infrastructureResolvedokurz

openQA Infrastructure - action #109635: Check grafana monitoring host performance size:MWorkable

openQA Infrastructure - action #109746: Improve QA related server room management, consistent naming and tagging size:MWorkable

openQA Infrastructure - action #110296: Machines within .qa.suse.de unavailable (was: Some ipmi workers become offline which affects PublicRC-202204 candidate Build137.1 test run)Resolvednicksinger

openQA Infrastructure - action #110521: Improve QA related server room management, network topology and configuration size:MWorkable

openQA Infrastructure - action #111149: Recover openqaworker-arm-3Resolvedokurz

coordination #109968: [epic] fully containerized openQA on SLE microOSNew


Related issues

Related to openQA Project - action #80466: docker: Base the webUI and worker Dockerfiles in TumbleweedWorkable2020-11-26

Related to openQA Project - action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:MResolved2021-05-20

Related to openQA Project - action #110497: Minion influxdb data causing unusual download rates size:MResolved2022-05-01

History

#1 Updated by okurz over 1 year ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA to [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes)
  • Status changed from Workable to Blocked
  • Assignee set to okurz

I created this saga as apparently the whole "container" feature block was not apparent enough for some stakeholders.

Some subtasks are there, ready to be worked on. Setting to "Blocked" by subtasks.

#2 Updated by okurz over 1 year ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes) to [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
  • Description updated (diff)

#3 Updated by okurz over 1 year ago

  • Tracker changed from action to coordination

#4 Updated by ilausuch over 1 year ago

  • Related to action #80466: docker: Base the webUI and worker Dockerfiles in Tumbleweed added

#5 Updated by cdywan over 1 year ago

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

#6 Updated by okurz over 1 year ago

  • Description updated (diff)

cdywan wrote:

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

Right. I rephrased it so that it's clear in the motivation we mean "single, not-load-balanced production instances" in contrast to "personal instances that are fast to setup"

#7 Updated by cdywan about 1 year ago

  • Related to action #89749: containers: Allow to the user to choose the source of images in the docker-compose added

#8 Updated by okurz about 1 year ago

  • Related to deleted (action #89749: containers: Allow to the user to choose the source of images in the docker-compose)

#9 Updated by okurz about 1 year ago

  • Subject changed from [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes to [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

#10 Updated by mkittler 7 months ago

  • Related to action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:M added

#11 Updated by okurz 20 days ago

  • Related to action #110497: Minion influxdb data causing unusual download rates size:M added

Also available in: Atom PDF