Project

General

Profile

coordination #80142

[saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

Added by okurz over 2 years ago. Updated 18 days ago.

Status:
Blocked
Priority:
Normal
Assignee:
Category:
Feature requests
Target version:
Start date:
2018-09-26
Due date:
% Done:

73%

Estimated time:
(Total: 2.00 h)
Difficulty:

Description

Motivation

Nowadays a container based deployment becomes industry standard which we should fully support and prominently feature as supported both for simple single-instance setups of individual persons as well as multi-node setups in clusters.

Also, single, production instances of the webui can cause longer downtimes and make upgrades of OS more risky, e.g. when we do not have management access to VMs that might fail to reboot. Also, load-balancing can help as well as having switch-over deployments possible for easier testing, staging, etc.

Acceptance criteria

  • AC1: an openQA infrastructure deployed on kubernetes is part of our continuous testing setup
  • AC2: documentation exists how to setup redundant load-balancing infrastructures
  • AC3: The support for openQA on container management frameworks is prominently presented
  • AC4: documentation exists for simple, personal single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

Suggestions

Based on the spike conducted in #69355 we can streamline the support, add documentation, introduce proper testing, consider running that setup as part of our DevOps structure, etc.

state-of-the-art is k8s so we should aim for that. Maybe a "docker compose" file is a good intermediate step, then k8s with a helm chart, potentially also some setup based on gitlab, see
https://docs.gitlab.com/ee/ci/environments/incremental_rollouts.html#blue-green-deployment


Subtasks

action #41600: fallback mechanism for apache, e.g. on osd New

coordination #43706: [epic] Generate "download&use" docker image of openQA for SUSE QABlockedokurz

action #43712: Update upstream dockerfiles to provide an easy to use docker image of openQA-webuiResolvedilausuch

action #43715: Update upstream dockerfiles to provide an easy to use docker image of workersResolvedilausuch

action #43718: Docker image for webui and workers are versioned and uploaded to obs registryResolvedilausuch

action #80516: Docker image for webui and workers on docker hub reflect current stateWorkable

action #80518: provide container images for aarch64New

action #80520: Automatic tests for our openQA containers - webUI onlyResolvedilausuch

action #80534: publication+demo for updated openQA containersBlockedokurz

action #80682: Automatic tests for our openQA containers - worker onlyResolvedilausuch

action #80684: Automatic tests for our openQA containers - worker+webui connectionWorkable

action #81118: automatic container tests for os-autoinstResolvedokurz

coordination #43934: [epic] Manage o3 infrastructure with salt againBlockedokurz

action #90164: Make gitlab.suse.de/openqa/salt-states-openqa publicResolvedokurz

action #90167: Setup initial salt infrastructure for remote management within o3New

action #91332: Allow to contribute to salt-states-openqa over githubWorkable

action #55262: Install Pgpool-II or PgBouncer before PostgreSQL for openQA instances, e.g. to be used on OSDNew

coordination #55364: [epic] Let's make codecov reports reliableResolvedokurz

action #71857: flaky/unstable/sporadic test coverage from t/34-developer_mode-unit.tResolvedcdywan

action #78043: unstable/flaky/inconsistent statement coverage in t/lib/OpenQA/SeleniumTestResolvedokurz

action #80268: Fix flaky coverage - t/05-scheduler-full.tResolvedokurz

action #80274: Fix flaky coverage - t/lib/OpenQA/Test/Utils.pm size:MResolvedkraih

action #80298: Fix flaky coverage - lib/OpenQA/WebSockets.pmResolvedokurz

action #88594: codecov reports for individual files yield "forbidden" and files can not be shownResolvedokurz

action #89899: Fix flaky coverage - t/ui/27-plugin_obs_rsync_status_details.tResolvedkraih

action #69355: [spike] redundant/load-balancing webui deployments of openQAResolvedilausuch

openQA Infrastructure - action #70978: automatic reboots on o3 to activate new kernel versionsResolvedokurz

openQA Infrastructure - action #71098: openqaworker3 down but no alert was raisedResolvedokurz

openQA Infrastructure - action #73174: [osd][alert] Job age (scheduled) (median) alertResolvedokurz

action #73447: POC: Create openQA Web Application container image (feature)Resolvedilausuch

action #73450: POC: Create openQA worker container image (feature)Resolvedilausuch

openQA Infrastructure - action #76786: Configure static hostnames with salt for all salt nodesResolvedokurz

openQA Infrastructure - action #76876: Find a better (automated) way to inform infra about hanging (arm) workersResolvedcdywan

action #76990: Improve documentation for redundant/load-balancing webui deployments of openQAResolvedmkittler

openQA Infrastructure - coordination #78206: [epic] 2020-11-18 nbg power outage aftermathResolvedokurz

openQA Infrastructure - action #80540: idea: Conduct "power outage drills", e.g. once every half-year?Rejectedokurz

openQA Infrastructure - action #80542: Configure "automatic power-on" after power loss for openqaworker1 (and all others)Resolvedokurz

openQA Infrastructure - action #80544: Ensure that IPMI for powerqaworker-qam works reliablyResolvedokurz

openQA Infrastructure - action #92185: Our PowerPC worker machines need to be configured for automatic start after a power outageResolvedokurz

openQA Infrastructure - action #78218: [openQA][worker] Almost all openQA workers become offlineResolvedokurz

action #78390: Worker is stuck in "broken" state due to unavailable cache service (was: and even continuously fails to (re)connect to some configured web UIs)Resolvedmkittler

openQA Infrastructure - action #78438: openQA webui entry "Assigned worker" shows ip instead of names as formerly - manual cleanup workResolvedokurz

openQA Infrastructure - action #80482: qa-power8-5-kvm has been down for days, use more robust filesystem setupResolvedokurz

coordination #80908: [epic] Continuous deployment (package upgrade or config update) without interrupting currently running openQA jobsResolvedokurz

action #80910: openQA workers read updated configuration, e.g. WORKER_CLASS, whenever they are ready to pick up new jobsResolvedmkittler

action #80986: terminate worker process after executing all currently assigned jobs based on config/env variableResolvedmkittler

openQA Infrastructure - action #81884: openqa-webui should automatically restart on config updatesResolvedokurz

action #89200: Switch OSD deployment to two-daily deploymentResolvedmkittler

action #90152: module results missing on quick job (on auto-restarting worker)Resolvedmkittler

action #104178: Increase OSD deployment rate from every second day to dailyResolvedokurz

action #104841: Prevent empty changelog messages from osd-deployment when there are no changes size:MResolvedmkittler

action #105379: Continuous deployment of o3 workers - one worker first size:MResolvedmkittler

action #105885: Continuous deployment of o3 workers - all the other o3 workers size:MResolvedmkittler

action #111028: Continuous update of o3 webUIResolvedokurz

action #111377: Continuous deployment of osd workers - similar as on o3 size:MRejectedokurz

coordination #81060: [epic] openQA web UI in kubernetesBlockedokurz

action #110524: [timeboxed:20h][spike] openQA proof-of-concept within kubernetes size:MResolvedjbaier_cz

action #110725: Unexpected behavior for cache service under k3s when the CACHE_MIN_FREE_PERCENTAGE is set size:MWorkable

action #111323: Simplified web proxy setup (remove path rewrite)Resolvedjbaier_cz

action #111329: openQA within kubernetes with tested helm charts size:MResolvedjbaier_cz

action #113402: Handle error messages in existing helm chart CI test logsNew

action #87898: Add grafana alert for "broken workers" as reported by openQAResolvedmkittler

action #88187: Set the addresses in the "internal clients" configurableResolvedmkittler

coordination #89020: [epic] Support for multiple authentication providersWorkable

action #90929: get OAuth2 to work with salsa.debian.org (gitlab)Resolvedmkittler

action #89116: Enable openQA cloud IDE workflow out of the boxNew

action #89755: container: Fix missing shared directories and its permissionsResolved

coordination #89842: [epic] Scalable and streamlined docker-compose based openQA setupNew

action #89719: docker-compose up fails on masterResolvedilausuch

action #89722: Need automatic check for docker-composeResolvedilausuch

action #89731: containers: The deploy using docker-compose is not stable and eventually fails Resolvedilausuch

action #89749: containers: Allow to the user to choose the source of images in the docker-composeNew

action #89752: containers: Add a worker service as part of the docker-composeResolvedmkittler

coordination #90359: [epic] Customizable worker engineWorkable

action #90362: Allow to customize worker engine by configurationResolvedokurz

coordination #92037: [epic] Extensible openQA: Support for external openQA pluginsWorkable

action #92040: Support for external openQA plugins in home directoryNew

coordination #92575: [epic] feature switch capabilitiesNew

action #92578: infrastructure-instance wide feature switchesNew

coordination #92854: [epic] limit overload of openQA webUI by heavy requestsBlockedokurz

action #93925: Optimize SQL on /tests/overviewResolvedtinita

action #94354: Optimize /dashboard_build_results and /group_overview/* pagesResolvedtinita

action #94667: Optimize products/machines/test_suites API calls size:MResolvedkraih

action #94705: Monitor number of SQL queries in grafanaNew

action #94753: Try out munin on o3Resolvedtinita

action #97190: Limit size of initial requests everywhere, e.g. /, /tests, etc., over webUI and APIBlockedokurz

action #111770: Limit finished tests on /tests, but query configurable and show complete number of jobs size:SResolvedmkittler

action #113189: Research where we need limits size:SResolvedmkittler

action #114421: Add a limit where it makes sense after we have it for /tests, query configurable size:MResolvedrobert.richardson

action #119428: Ensure users can get all the data for limited queries, e.g. with paginationBlockedkraih

action #120841: Add pagination for GET /api/v1/assets size:MResolvedkraih

action #121048: Add pagination for GET /api/v1/bugsResolvedkraih

action #121099: Add pagination for GET /api/v1/jobs/<job_id:num>/commentsNew

action #121102: Add pagination for GET /api/v1/jobs size:MResolvedkraih

action #121105: Add pagination for GET /api/v1/test_suites, GET /api/v1/test_suites/:id, GET /api/v1/machines, GET api/v1/machines/:id, GET /api/v1/products, GET /api/v1/products:id size:MResolvedkraih

action #121108: Add pagination for GET /api/v1/workersResolvedkraih

action #121111: Add pagination for GET /api/v1/job_templatesNew

action #121114: Add pagination for GET /api/v1/job_settings/jobsNew

action #121117: Add pagination for GET /experimental/searchNew

action #119431: Inform users e.g. in the webUI if not all results are returned size:MBlockedkraih

action #124230: "Off-by-one" error when using the limit=1 openQA API parameterResolvedmkittler

action #106759: Worker xyz has no heartbeat (400 seconds), restarting repeatedly reported on o3 size:MResolvedcdywan

action #110677: Investigation page shouldn't involve blocking long-running API routes size:MResolvedtinita

action #110680: Overview page shouldn't allow long-running requests without limits size:MResolvedkraih

action #126935: Promote /experimental/jobs/.../status as non-experimental APINew

openQA Infrastructure - action #126941: [spike] Evaluate a move of the osd database to its own VMNew

action #94255: containers: Improve the speed of the container test in CINew

coordination #96263: [epic] Exclude certain Minion tasks from "Too many Minion job failures alert" alertNew

action #70774: save_needle Minion tasks fail frequentlyNew

coordination #99831: [epic] Better handle minion tasks failing with "Job terminated unexpectedly"New

action #103416: Better handle minion tasks failing with "Job terminated unexpectedly" - "limit_results_and_logs" size:MResolvedmkittler

action #104116: Better handle minion tasks failing with "Job terminated unexpectedly" - "scan_needles" size:MResolvedokurz

action #107533: Better handle minion tasks failing with "Job terminated unexpectedly" - "finalize_job_results" size:MResolvedmkittler

action #108980: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Asset::DownloadNew

action #108983: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::Iso::ScheduleNew

action #108989: Better handle minion tasks failing with "Job terminated unexpectedly" - OpenQA::Task::NeedleNew

action #99837: configurable exclusion rules for /influxdb/minionNew

action #100503: Identify all "finalize_job_results" failures and handle them (report ticket or fix)Resolvedcdywan

openQA Infrastructure - action #97943: Increase number of CPU cores on OSD VM due to high usage size:SResolvedmkittler

coordination #98472: [epic] Scale out: Disaster recovery deployments of existing openQA infrastructuresResolvedokurz

coordination #98952: [epic] t/full-stack.t sporadically fails "clickElement: element not interactable" and other errorsResolvedmkittler

action #102578: [sporadic] t/full-stack.t Failed test 'Expected result for job 1 not found' size:MResolvedokurz

coordination #102581: Proof of concept of t/full-stack.t on GitHubActionsResolvedcdywan

action #105429: openQA's fullstack test fails in `shutdown` moduleResolvedmkittler

action #106912: Fullstack test can still fail due to `shutdown` module size:MResolvedokurz

action #107002: Expose fullstack test video from pool directory in CI size:MResolvedcdywan

action #107941: [sporadic] openQA Fullstack test t/full-stack.t can still fail with "udevadm" log message size:MResolvedkraih

action #120226: Ensure openQA t/full-stack.t is stable again and not tracked as unstable test size:SResolvedkraih

coordination #99549: [epic] Split production workload onto multiple hosts (focusing on OSD)New

coordination #99579: [epic][retro] Follow-up to "Published QCOW images appear to be uncompressed"Resolvedokurz

action #99654: Revisit decision in https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/545 regarding I/O alerts size:SResolvedmkittler

coordination #99660: [epic] Use more perl signatures in our perl projectsResolvedokurz

action #99663: Use more perl signatures - os-autoinst size:MResolvedokurz

action #105127: Use more perl signatures - openQA - some simple classes size:SResolvedkodymo

openQA Infrastructure - coordination #102266: [epic] o3 ran out of disk spaceResolvedokurz

openQA Infrastructure - action #104217: Ask eng infra why thruk.suse.de stopped workingResolvedokurz

coordination #102951: [epic] Better network performance monitoringResolvedokurz

action #102957: Better network performance monitoring - up-/download speed from cache service, e.g. in log file size:MResolvedkraih

action #106901: Expose bandwidth data for worker cache via influxdb size:MResolvedkraih

action #106904: Monitoring for worker specific bandwidth size:MResolvedkraih

coordination #103122: [epic] Test distribution template for openQA tests to use, e.g. for automatic PR tests in os-autoinst-distri-openQABlockedokurz

action #103125: compile tests in os-autoinst-distri-openQA defined in template project size:SWorkable

action #107701: [osd] Job detail page fails to loadResolvedtinita

openQA Infrastructure - action #108740: qa-power8-5-kvm minions alert is heart-broken 💔️Rejectedokurz

openQA Infrastructure - action #108743: qa-power8-5-kvm minions alert is heart-brokenResolvednicksinger

coordination #109659: [epic] More remote workersBlockedokurz

openQA Infrastructure - action #104970: Add two OSD workers (openqaworker14+openqaworker15) specifically for sap-application testing size:MResolvedmkittler

action #107497: [qe-tools] openqaworker14 (and openqa15) - developer mode fails size:MResolvedmkittler

openQA Infrastructure - action #110467: Establish reliable tap setup on ow14New

action #110515: Command export feature for openqa-clone-job size:MResolvedmkittler

openQA Infrastructure - action #113366: Add three more Prague located OSD workers size:MResolvedosukup

openQA Infrastructure - action #124562: Re-install at least one of the new OSD workers located in PragueNew

coordination #109968: [epic] fully containerized openQA on SLE microOSNew

action #126938: [spike] Simple load-balancing proxy for select API routesNew


Related issues

Related to openQA Project - action #80466: docker: Base the webUI and worker Dockerfiles in TumbleweedWorkable2020-11-26

Related to openQA Project - action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:MResolved2021-05-20

Related to openQA Project - action #110497: Minion influxdb data causing unusual download rates size:MResolved2022-05-01

Related to openQA Project - coordination #80150: [epic] Scale out openQA: Easier openQA setupBlocked2020-11-04

Copied to openQA Project - coordination #121723: [saga][epic] Scale out: Future uses of on-premise cloud or public cloudNew

History

#1 Updated by okurz over 2 years ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA to [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes)
  • Status changed from Workable to Blocked
  • Assignee set to okurz

I created this saga as apparently the whole "container" feature block was not apparent enough for some stakeholders.

Some subtasks are there, ready to be worked on. Setting to "Blocked" by subtasks.

#2 Updated by okurz over 2 years ago

  • Subject changed from [saga][epic] redundant/load-balancing webui deployments of openQA (container on kubernetes) to [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes
  • Description updated (diff)

#3 Updated by okurz over 2 years ago

  • Tracker changed from action to coordination

#4 Updated by ilausuch over 2 years ago

  • Related to action #80466: docker: Base the webUI and worker Dockerfiles in Tumbleweed added

#5 Updated by cdywan over 2 years ago

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

#6 Updated by okurz over 2 years ago

  • Description updated (diff)

cdywan wrote:

  • AC4: documentation exists for simple single-instance setups, e.g. "get your openQA tests to run in less than 5 minutes"

This seems oddly phrased considering how the Motivation states:

Single instances of the webui can cause longer downtimes and make upgrades of OS more risky

I would expect that as a side-effect of using kubernetes users decide what resources they want to use.

Right. I rephrased it so that it's clear in the motivation we mean "single, not-load-balanced production instances" in contrast to "personal instances that are fast to setup"

#7 Updated by cdywan about 2 years ago

  • Related to action #89749: containers: Allow to the user to choose the source of images in the docker-compose added

#8 Updated by okurz about 2 years ago

  • Related to deleted (action #89749: containers: Allow to the user to choose the source of images in the docker-compose)

#9 Updated by okurz about 2 years ago

  • Subject changed from [saga][epic] Scale out openQA: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes to [saga][epic] Scale out: Redundant/load-balancing deployments of openQA, easy containers, containers on kubernetes

#10 Updated by mkittler over 1 year ago

  • Related to action #92893: containers, docker-compose: Ensure that the scheduler can connect to the websockets container size:M added

#11 Updated by okurz about 1 year ago

  • Related to action #110497: Minion influxdb data causing unusual download rates size:M added

#12 Updated by okurz 12 months ago

  • Description updated (diff)

#13 Updated by okurz 12 months ago

  • Description updated (diff)

#14 Updated by okurz 6 months ago

  • Copied to coordination #121723: [saga][epic] Scale out: Future uses of on-premise cloud or public cloud added

#15 Updated by okurz about 2 months ago

  • Target version changed from Ready to future

#16 Updated by okurz 18 days ago

Also available in: Atom PDF