Project

General

Profile

Actions

action #132146

closed

coordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability

coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo

Support migration of osd VM to PRG2 - 2023-08-29 size:M

Added by okurz 10 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Start date:
2023-06-29
Due date:
% Done:

0%

Estimated time:

Description

Motivation

The openQA webUI VM for osd will move to PRG2, same as for o3. This will be conducted by Eng-Infra. We must support them.

Acceptance criteria

  • AC1: osd is reachable from the new location for SUSE employees
  • AC2: osd multi-machine jobs run successfully on osd after the migration
  • AC3: We can still login into the machine over ssh
  • AC4: https://monitor.qa.suse.de can still reach and monitor OSD

Suggestions


Related issues 7 (2 open5 closed)

Related to openQA Infrastructure - action #134816: [tools] grafana dashboard for `OpenQA Jobs test` partially without any data from OSD migration size:MResolvedokurz2023-08-30

Actions
Related to openQA Infrastructure - action #134846: Old NFS share mount is keeping processes stuck and openQA workers seem up but do not work on jobsNew2023-08-30

Actions
Related to openQA Infrastructure - action #134879: reverse DNS resolution PTR for openqa.oqa.prg2.suse.org. yields "3(NXDOMAIN)" for PRG1 workers (NUE1+PRG2 are fine) size:MResolvedokurz2023-08-31

Actions
Related to openQA Infrastructure - action #134900: salt states fail to apply due to "Pillar openqa.oqa.prg2.suse.org.key does not exist"Resolvednicksinger2023-08-31

Actions
Copied from openQA Infrastructure - action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:MResolvednicksinger2023-06-29

Actions
Copied to openQA Project - action #134837: SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:MResolvedlivdywan

Actions
Copied to QA - action #134888: Ensure no job results are present in the file system for jobs that are no longer in the databaseNew

Actions
Actions #1

Updated by okurz 10 months ago

  • Copied from action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Actions #2

Updated by okurz 10 months ago

  • Subject changed from Support migration of osd VM to PRG2 to Support migration of osd VM to PRG2 - 2023-08-01
Actions #3

Updated by okurz 10 months ago

  • Subject changed from Support migration of osd VM to PRG2 - 2023-08-01 to Support migration of osd VM to PRG2 - 2023-08-01 size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #4

Updated by okurz 10 months ago

I wrote in https://suse.slack.com/archives/C04MDKHQE20/p1690262542766069

@John Ford @Moroni Flores @Matthias Griessmeier (CC @here) It's one week until the planned migration of openqa.suse.de: Given the problems observed in o3 migration, in particular the very slow syncing of storage volumes, the still present network problems at NUE1+NUE2 (like https://progress.opensuse.org/issues/133127) taking away ressources for both SUSE QE Tools team as well as Eng-Infra as well as other problems like https://progress.opensuse.org/issues/133250 (https://sd.suse.com/servicedesk/customer/portal/1/SD-128313) I do not see it feasible to go forward with the OSD migration on the planned day. Of course any delay will have it's impact as well and there are personell absences which we have to take into account, e.g. me on vacation 2023-08-12 to 2023-08-27. WDYT? My preliminary suggestion for a migration date would be 2023-08-29 also assuming that OBS migration has sufficiently progressed at that time

Actions #5

Updated by okurz 10 months ago

  • Status changed from Workable to Feedback
  • Assignee set to okurz

still waiting for confirmation of change of plans or sticking to old plan in https://suse.slack.com/archives/C04MDKHQE20/p1690315843382579?thread_ts=1690262542.766069&cid=C04MDKHQE20

Actions #6

Updated by okurz 10 months ago

  • Subject changed from Support migration of osd VM to PRG2 - 2023-08-01 size:M to Support migration of osd VM to PRG2 - 2023-08-29 size:M

In the weekly DCT migration call we agreed that we do the migration cutover on 2023-08-29. Many days before we should have access to a new VM with new storage volumes and r/o snapshots of the old and then copy over content and try to connect workers. We can even use the "connect to multiple webUIs" approach here.

Waiting for mcaj to create an Eng-Infra ticket for the preparation of the new VM.

Actions #7

Updated by okurz 9 months ago

There was a request if we can migrate earlier. My statement in https://suse.slack.com/archives/C04MDKHQE20/p1690810541609959

@John Ford as stated today in the morning for the next step of OSD migration we need the VM in prg2. If that can be provided until tomorrow EOB a cutover of OSD might be possible the week after that

I provided more specific requirements in https://jira.suse.com/browse/ENGINFRA-2524 now:

1. Our latest plan as of 2023-07-26 with mcaj was that we need a copy of the openqa.suse.de VM machine definition, bring it up with new storage devices of equal or bigger size as before plus the most recent available read-only snapshots of storage volumes from NUE1. As soon as we within the LSG QE Tools system have access to the system over ssh we can sync over the content from the read-only snapshots to the new storage targets within the running systems dynamically. This is likely to take multiple days but should be fully under our control.
2. Also please provide access to the hypervisor so that we can control and potentially recover any reboot attempts (Same as done for openqa.opensuse.org). The alternative would be a necessary extended availability of Eng-Infra members to be able react quickly on problems.
3. Please coordinate with LSG QE Tools in Slack #dct-migration, not direct messages to individuals who might be unavailable.

openqa.suse.de VM racktables entry: https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9456

Details for the LSG QE Tools plans regarding the migration in https://progress.opensuse.org/issues/132146 . Details about the previously conducted o3 migration in https://progress.opensuse.org/issues/132143
Actions #8

Updated by okurz 9 months ago

  • Status changed from Feedback to Blocked
Actions #9

Updated by okurz 9 months ago

Clarified https://jira.suse.com/browse/ENGINFRA-2524 with mcaj, ndachev, nsinger, … in https://meet.jit.si/suse_qa_tools . We will meet again 2023-08-08 0900Z in https://meet.jit.si/suse_qa_tools . mcaj+ndachev will setup VM copy, setup initial network and provide access to SUSE QE Tools. We will sync over content with rsync from ro-snapshots of old volumes to new fresh data volumes and prepare a seamless cutover of services until 2023-08-29, i.e. change DNS entries when SUSE QE Tools gives the Go-signal. An earlier cutover is unfeasible due to the expected data sync to take multiple days and okurz+nsinger in vacation in 2023-w33+w34

So suggestions for specific steps that we should do as soon as we have access to new-osd:

  1. ask mcaj+ndachev for device names of "old r/o-snapshot storage volumes" and "new empty storage volumes", let's say "old-assets" will be vdx and "new-assets" will be vdy for example
  2. Create a screen session so that all the actions run persistently
  3. Then create filesystem on new-assets, e.g. mkfs.xfs /dev/vdy
  4. Then mount old-assets and new-assets, e.g. mkdir /mnt/{old,new}-assets && mount -o ro /dev/vdx /mnt/old-assets && mount /dev/vdy /mnt/new-assets
  5. Sync over initially, e.g. rsync -aHP /mnt/{old,new}-assets/, and monitor
  6. After some hours give estimate how long the sync is going to take and extrapolate for the other volumes
  7. Prepare delta-sync, i.e. everything that will be written to old-osd in the meantime, e.g. rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /mnt/new-assets/ after the sync from the r/o-snapshot finished
  8. Optional: Get fancy with remote mount from old-osd over network to prepare a seamless transition when we can already run from new-osd with assets+results from old-osd until in the background you finish the syncing
Actions #10

Updated by okurz 9 months ago

  • Status changed from Blocked to In Progress
  • Assignee changed from okurz to mkittler

Handed over to mkittler.

Actions #11

Updated by okurz 9 months ago

  • Tags changed from infra, osd, prg2, dct migration to infra, osd, prg2, dct migration, mob
Actions #12

Updated by mkittler 9 months ago

Tests I would do to verify whether OSD can reach workers:


EDIT: I've just tested on the new openqa.oqa.prg2.suse.org the first two points (used also ports 20013 and 20023 for HTTP traffic) and it works for worker29.oqa.prg2.suse.org (one of the new workers), openqaworker8.suse.de and openqaworker-arm-2.suse.de (two of the old workers). So the developer mode will work again once we migrated the VM and possible ping alerts won't be firing anymore as well.

Actions #13

Updated by openqa_review 9 months ago

  • Due date set to 2023-08-23

Setting due date based on mean cycle time of SUSE QE Tools

Actions #14

Updated by mkittler 9 months ago

The sync is running since yesterday as mentioned in #132146#note-9.

Actions #15

Updated by tinita 9 months ago

  • Due date changed from 2023-08-23 to 2023-09-08
Actions #16

Updated by mkittler 9 months ago

We have forgotten to disable the auto-update. So I had to start the sync again and I'm not sure how far it came. I now stopped and masked rebootmgr.service and stopped and disabled auto-update.timer to prevent that from happening again.

/etc/fstab was also broken. Martin fixed it (see https://jira.suse.com/browse/ENGINFRA-2524?focusedCommentId=1285152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1285152) and also created a backup under /etc/fstab.backup. Perhaps Salt has messed with it so I stopped and disabled salt-master.service and salt-minion.service. I also created a draft for updating our Salt states accordingly: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/941

Actions #17

Updated by mkittler 9 months ago

# time rsync -aHP /mnt/ro/assets/ /assets/
sending incremental file list
file has vanished: "/mnt/ro/assets/factory/repo/SLE-15-SP4-Online-s390x-GM-Media1"
file has vanished: "/mnt/ro/assets/factory/tmp/public/11822757/SLES-15-SP4-x86_64-:30223:erlang-Server-DVD-Incidents@64bit-with-external_testkit.qcow2.CHUNKS/SLES-15-SP4-x86_64-:30223:erlang-Server-DVD-Incidents@64bit-with-external_testkit.qcow2"
rsync: [generator] failed to set times on "/assets/factory/tmp": Read-only file system (30)
factory/tmp/
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1330) [sender=3.2.3]

real    33m35.342s
user    0m11.184s
sys     0m30.274s

The sync for results is still ongoing. The sync for srv has been completed and the no-op doesn't take long. I've just started time rsync -aHP /mnt/ro/space-slow/ /space-slow/ to see how long the no-op for that partition takes.

Actions #18

Updated by mkittler 9 months ago

The no-op for /space-slow has been completed as well:

openqa:~ # time rsync -aHP /mnt/ro/space-slow/ /space-slow/
sending incremental file list
…
real    154m35.528s
user    5m22.017s
sys     12m17.096s

It was not really a no-op so I'm doing that again but I guess we're generally ok here.

That only leaves results which is unfortunately not possible due to:

openqa:~ # l /mnt/ro/results/
ls: cannot access '/mnt/ro/results/': Input/output error

We could not resolve this by re-mounting or restarting the VM. I suppose I'll just continue with:

time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
Actions #19

Updated by mkittler 9 months ago

/space-slow has just finished as actual no-op without errors:

openqa:~ # time rsync -aHP /mnt/ro/space-slow/ /space-slow/
sending incremental file list

real    99m25.462s
user    2m32.740s
sys     10m37.868s

Considering we have now synced everything from read-only devs (except that some results are missing and the read-only dev is now inaccessible) I'll start syncing from the actual OSD.

Actions #20

Updated by mkittler 9 months ago

I have now invoked

openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/space-slow/ /space-slow/

in different screen sessions.

Actions #22

Updated by mkittler 9 months ago

openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/space-slow/ /space-slow/
…
real    190m29.123s
user    2m51.277s
sys     7m2.874s
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/srv/ /srv/
receiving incremental file list
…
real    26m21.517s
user    5m48.051s
sys     4m8.627s
Actions #23

Updated by mkittler 9 months ago

The sync of assets and results triggered as mentioned in #132146#note-20 is still ongoing.

Actions #24

Updated by mkittler 9 months ago

openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
testresults/11858/11858999-sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29613:python3-qt5-qam_ha_priority_fencing_node02@64bit/ulogs/
webui/
webui/cache/
webui/cache/asset-status.json
     21,691,665 100%   14.19MB/s    0:00:01 (xfr#7083559, to-chk=0/303460578)

real    3286m46.492s
user    94m51.405s
sys     160m53.979s
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
tests/vmdp/.git/index
         19,081 100%   18.20MB/s    0:00:00 (xfr#320294, ir-chk=1033/2881761)
tests/vmdp/.git/objects/
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1835) [generator=3.2.3]

real    2103m32.269s
user    234m26.900s
sys     176m52.515s
Actions #25

Updated by okurz 9 months ago

3286m46.492s means 55h so 2.5d. But that was not the no-op yet, was it? If it would be that would be acceptable for a weekend but I think we can also find better solutions like switching OSD to read-only, only syncing srv with the database, switching on new osd with the in-sync database and then slowly sync the missing results over while new osd is running

Actions #26

Updated by mkittler 9 months ago

No, that's not no-op yet. Of course it'll never really be no-op as long as the old VM is still producing/changing data. So switching the old VM to read-only like you've mentioned would be helpful. I think we can decide on that when you're back.

Those are now the figures for the next round:

openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
…
testresults/11888/11888999-sle-15-SP5-Server-DVD-Updates-x86_64-Build20230820-1-fips_tests_crypt_krb5_client@64bit/ulogs/
webui/
webui/cache/
webui/cache/asset-status.json
     20,678,271 100%   10.08MB/s    0:00:01 (xfr#4206743, to-chk=0/298066900)

real    2353m9.519s
user    71m29.552s
sys     113m48.671s
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
tests/vmdp/.git/objects/
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1835) [generator=3.2.3]

real    1643m17.639s
user    224m10.813s
sys     138m39.520s

So it went down from 2.5 d to 1.634 d. I'll start another round for results and assets because it is supposedly a good idea to catch up.

Actions #27

Updated by mkittler 9 months ago

openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
…
real    1194m46.700s
user    144m7.061s
sys     90m15.923s

The last sync of results is still running.

Actions #28

Updated by okurz 8 months ago

I logged into the VM openqa.oqa.prg2.suse.org, switched to root, attached to the screen session, named the individual screen shells, e.g. rsync-assets, and restart the syncs for results, assets, srv, space-slow.

I now set the limit max_running_jobs on openqa.suse.de:/etc/openqa/openqa.ini now to 140 as the next step.

EDIT: I also did on old-osd systemctl edit openqa-scheduler and added

[Service]
…
Environment="MAX_JOB_ALLOCATION=160"

We added a "migration announcement" notice on the index page of both old-osd and new-osd.

I stopped and masked salt-minion and salt-master on old-osd with sudo systemctl mask --now salt-minion salt-master and disabled https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules , added according steps to "Suggestions" in ticket description.

I setup a recurring sync of space-slow with

while date -Is && sleep 1200; do time rsync -aHP --delete --one-file-system openqa.suse.de:/space-slow/ /space-slow/; done

and accordingly for srv, results, assets. For results as "images/" and "testresults/" are very large I split this up into

while date -Is && sleep 1200; do time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/ --exclude=images/ --exclude=testresults/ && for i in images testresults; do time rsync -aHP --delete --one-file-system openqa.suse.de:/results/$i/ /results/$i/; done; done

Tomorrow morning we should turn old-osd to read-only, sync over /srv, enable /srv bind mount on new-osd, start the database and carefully check the database+webUI. Then trigger sync of assets+results+space-slow. As soon as everything is in sync switch-over DNS and ensure webUI is up, then ensure worker connection. Alternatively enable full services again with incomplete assets+results and sync over assets+results without --delete while webUI is already running from new location.

  1. @nicksinger prepare the apache config changes to only allow GET requests
  2. @nicksinger prepare a merge request for DNS switch-over change in OPS-Service
Actions #29

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #30

Updated by nicksinger 8 months ago

Adding the following to our apach2 config (/etc/apache2/vhosts.d/openqa-common.inc) should switch old OSD into RO mode:

<Location />
    <LimitExcept GET>
       Require all denied
    </LimitExcept>
</Location>
Actions #32

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #34

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #36

Updated by okurz 8 months ago

We realized that we have a static IP entry on salt controlled hosts, removing those

for i in $(ssh osd-new "sudo salt-key -L" | grep -v ':$'); do ssh -o StrictHostKeyChecking=no $i "sudo sed -i -e '/2620:113:80c0:8080:10:160:0:207/d' -e 's/10.160.0.207/10.145.10.207/' /etc/hosts"; done

By the way, the announcement message on index page should be

🚨 We're currently migrating this service - Please also see the announcement for more details. This is https://openqa.suse.de in the old location. This server is now in read-only mode

and respectively for the new location as well.

Actions #37

Updated by tinita 8 months ago

tinita wrote in #note-35:

Regarding rabbitmq

Pulishing works again after Lazaros did some firewall adjustment

Actions #38

Updated by tinita 8 months ago

  • Description updated (diff)
Actions #39

Updated by nicksinger 8 months ago

salt-states deployment pipeline cannot access new OSD via ssh. I wrote Lazaros in Slack who helped to unblock this. Pipelines can now access new OSD.
We now also face a similar issue like described in https://progress.opensuse.org/issues/134522 which requires https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/587 and adjustments to the nue.suse.com zone which does not seem possible in the ops-salt repo - I asked how we can do this in slack: https://suse.slack.com/archives/C04MDKHQE20/p1693310485167729

Actions #40

Updated by mkittler 8 months ago

Looks like the static IP entry is now fixed on all hosts. I'm restarting salt-minion via for i in $( ssh openqa.suse.de 'sudo salt-key --list=accepted | tail -n +2' ) ; do echo "host $i" && ssh -4 -o StrictHostKeyChecking=no "$i" 'sudo systemctl restart salt-minion' ; done because otherwise it apparently doesn't use the updated entry.

Actions #41

Updated by nicksinger 8 months ago

  • Description updated (diff)
Actions #42

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #43

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #44

Updated by mkittler 8 months ago

  • Description updated (diff)
Actions #45

Updated by mkittler 8 months ago

MR to fix a broken Grafana panel and alert after the migration: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/950

Actions #46

Updated by okurz 8 months ago

https://openqa.suse.de/tests/11930657 says "Reason: api failure: 400 response: mkdir /var/lib/openqa/testresults/11930/11930657-sle-15-SP4-EC2-BYOS-Updates-x86_64-Build20230828-1-publiccloud_containers@64bit: Permission denied at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 84. ". This looks like an issue on OSD itself. Running chown -R geekotest.nogroup 119* on OSD now. Unclear what caused this. Running host=openqa.suse.de openqa-advanced-retrigger-jobs to retrigger incompletes.

announcement sent over email and Slack about completion of the core part of migration:

We are happy to report that https://openqa.suse.de with it's alias https://openqa.nue.suse.com has successfully been migrated from NUE1 to PRG2 datacenter. The instance is eagerly executing test jobs from its new location. We will continue to monitor the system even more closely than usual in the next days. Som background tasks and cleanup is still being conducted. If you find any potentially related issues please report them as usual. Enjoy :)

Actions #47

Updated by okurz 8 months ago

  • Description updated (diff)

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/950 merged. description updated with resolved tasks

Actions #48

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #49

Updated by okurz 8 months ago

The results sync finished. I would not start it again and trust it's complete now. I also checked permissions again multiple times in /var/lib/openqa/testresults and did not encounter the former problem again about "root.root" ownership.

Actions #50

Updated by okurz 8 months ago

  • Related to action #134816: [tools] grafana dashboard for `OpenQA Jobs test` partially without any data from OSD migration size:M added
Actions #51

Updated by okurz 8 months ago

reverse DNS entry seems to be invalid, should be fixed by https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3935/

Actions #52

Updated by okurz 8 months ago

  • Copied to action #134837: SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M added
Actions #53

Updated by okurz 8 months ago

Many workers were stuck due to openQA worker processes being stuck trying to read the old stale NFS share from old-OSD. E.g.

# systemctl status openqa-worker-auto-restart@16
● openqa-worker-auto-restart@16.service - openQA Worker #16
     Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
             └─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
     Active: active (running) since Wed 2023-08-30 11:20:37 CEST; 3h 31min ago
    Process: 51156 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/16 (code=exited, status=0/SUCCESS)
   Main PID: 51157 (worker)
      Tasks: 1 (limit: 12287)
     CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@16.service
             └─ 51157 /usr/bin/perl /usr/share/openqa/script/worker --instance 16

Aug 30 11:20:38 openqaworker14 worker[51157]: [info] [pid:51157] worker 16:
Aug 30 11:20:38 openqaworker14 worker[51157]:  - config file:                      /etc/openqa/workers.ini
Aug 30 11:20:38 openqaworker14 worker[51157]:  - name used to register:            openqaworker14
Aug 30 11:20:38 openqaworker14 worker[51157]:  - worker address (WORKER_HOSTNAME): openqaworker14.qa.suse.cz
Aug 30 11:20:38 openqaworker14 worker[51157]:  - isotovideo version:               40
Aug 30 11:20:38 openqaworker14 worker[51157]:  - websocket API version:            1
Aug 30 11:20:38 openqaworker14 worker[51157]:  - web UI hosts:                     openqa.suse.de
Aug 30 11:20:38 openqaworker14 worker[51157]:  - class:                            qemu_x86_64,qemu_x86_64_staging,qemu_x86_64-large-mem,windows11,wsl2,platform_intel,prg,prg_office,openqaworker14
Aug 30 11:20:38 openqaworker14 worker[51157]:  - no cleanup:                       no
Aug 30 11:20:38 openqaworker14 worker[51157]:  - pool directory:                   /var/lib/openqa/pool/16

same on other machine and reason is a process being stuck, see

$ ssh openqaworker17.qa.suse.cz 
Have a lot of fun...
okurz@openqaworker17:~> ps auxf | grep '\<D\>'
root      2775  0.0  0.0  36596  8264 ?        Ss   Aug27   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
okurz    64738  0.0  0.0   8208   772 pts/0    S+   14:55   0:00              \_ grep --color=auto \<D\>
root     34646  0.0  0.0   5520   728 ?        D    10:59   0:00      \_ df

the "df" call is by the openQA worker cache service I think.

That is due to the NFS share on /var/lib/openqa/share.

Now we called

salt \* cmd.run 'ps -eo pid,stat | grep " D\>" && reboot'

to reboot all machines with stuck processes. Created #134846 as improvement.

EDIT: https://openqa.suse.de/tests shows currently 340 concurrent running jobs, limited by server config. So good so far.

Actions #54

Updated by mkittler 8 months ago

To actually fix the host alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/955

EDIT: The MR has been merged and now the query in the alert is correct (see https://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view).

Actions #55

Updated by mkittler 8 months ago

To fix the remaining alerts we have to either remove the IPv6 AAAA records in the DNS config, fix IPv6 or only use IPv4 explicitly in our monitoring. Possibly we can also apply the same workaround as we already have on o3 (#133403#note-5).

Actions #56

Updated by okurz 8 months ago

  • Related to action #134846: Old NFS share mount is keeping processes stuck and openQA workers seem up but do not work on jobs added
Actions #57

Updated by mkittler 8 months ago

  • Description updated (diff)

The backup of the new OSD VM on the backup VM works, e.g. /home/rsnapshot/alpha.0/openqa.suse.de/etc/fstab is current.

I've also just disabled the root login on OSD.

Those were the only item from o3's list that applies here and hasn't already been done or mentioned anyways.

Actions #58

Updated by mkittler 8 months ago

Looks like many workers are not connected. I suppose the following can be ignored:

  • worker11.oqa.suse.de: not reachable via SSH, no output visible via SOL, not accepted via salt anyways
  • worker12.oqa.suse.de: reachable but no slots online, not accepted via salt anyways
  • openqaworker6: moved to o3, see #129484
  • openqaworker-arm-1: moved to NUE2, not fully back yet
  • grenache-1: deracked for now

This leaves:

  • openqaworker9: just slot 12 and 15
  • worker40: shown as offline despite services running
  • broken slots on sapworker1, worker2 and a few of the worker3x ones
Actions #59

Updated by mkittler 8 months ago

The openqaworker9 slots are actually just leftovers which I've removed.

The broken slots are just a displaying error. When waiting shortly they show up as idle and then some other set of slots shows up as broken. I think this affects slots that have just finished a job. I guess this is something for another ticket (the graceful disconnect feature likely needs to be reworked).

worker40 was caused by a hanging nfs mount and our previous rebooting didn't take the worker into account. So I've just rebooted it now. Now the workers don't hang anymore but are still unable to connect:

Aug 30 18:03:03 worker40 worker[18479]: [warn] [pid:18479] Failed to register at openqa.suse.de - connection error: Connection refused - trying again in 10 seconds

So only worker40 is still problematic. I'll look into it tomorrow.

Actions #60

Updated by okurz 8 months ago

  • Related to action #134879: reverse DNS resolution PTR for openqa.oqa.prg2.suse.org. yields "3(NXDOMAIN)" for PRG1 workers (NUE1+PRG2 are fine) size:M added
Actions #61

Updated by okurz 8 months ago

  • Copied to action #134888: Ensure no job results are present in the file system for jobs that are no longer in the database added
Actions #62

Updated by okurz 8 months ago

  • Description updated (diff)

mkittler wrote in #note-59:

So only worker40 is still problematic. I'll look into it tomorrow.

That is better done in #132137 though. I split out AC5 into action #134888: Ensure no job results are present in the file system for jobs that are no longer in the database . With that this leaves https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit which I enabled now and triggered as https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/784218

Actions #63

Updated by okurz 8 months ago

  • Related to action #134900: salt states fail to apply due to "Pillar openqa.oqa.prg2.suse.org.key does not exist" added
Actions #64

Updated by okurz 8 months ago

  • Description updated (diff)
Actions #65

Updated by mkittler 8 months ago

Yes, since only worker40 was remaining (which is one of the new Prage located workers) it could have been done as part of #132137. However, I was of course also checking any other workers. I was able to fix the connectivity issues with worker40 now anyways. It was not in salt because I only added it after we setup the new VM and the change hasn't been carried over. That also mean it still had the old static IP entry in /etc/hosts which prevented it from even showing up as unaccepted host in salt key (so it wasn't noticed).

By the way, I'm not sure whether the workers showing as broken with "graceful disconnect" are really just displaying errors. I think the problem is that these workers have been registering themselves via the API but are still waiting for the websocket server¹ to respond. So their basically still in progress of establishing a full connection. That state is not handled well by our displaying code but also problematic on its own as it means that the websocket server is overloaded. I suppose I should create a ticket for that.

¹ e.g. worker slots are really very long at

Aug 31 12:11:55 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:11:56 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:16:56 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:17:06 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:17:10 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108

without the Registered and connected via websockets … line showing up yet.

Maybe I should create a separate ticket for that.

EDIT: It looks like this isn't even working at all or at least some timeout has been hit:

Aug 31 12:11:55 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:11:56 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:16:56 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:17:06 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:17:10 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:22:10 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:22:20 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3108

Considering the timestamps are exactly 5 minutes apart (most likely the gateway timeout for the websocket connection) and it worked on the next attempt it is likely just the websocket server being severely overloaded.

EDIT: I've created a ticket for that #134924.

Actions #66

Updated by okurz 8 months ago

  • Due date deleted (2023-09-08)
  • Status changed from In Progress to Resolved

Then we are good

Actions

Also available in: Atom PDF