action #132146
closedcoordination #121720: [saga][epic] Migration to QE setup in PRG2+NUE3 while ensuring availability
coordination #123800: [epic] Provide SUSE QE Tools services running in PRG2 aka. Prg CoLo
Support migration of osd VM to PRG2 - 2023-08-29 size:M
0%
Description
Motivation¶
The openQA webUI VM for osd will move to PRG2, same as for o3. This will be conducted by Eng-Infra. We must support them.
Acceptance criteria¶
- AC1: osd is reachable from the new location for SUSE employees
- AC2: osd multi-machine jobs run successfully on osd after the migration
- AC3: We can still login into the machine over ssh
- AC4: https://monitor.qa.suse.de can still reach and monitor OSD
Suggestions¶
- DONE Inform affected users about planned migration on date 2023-08-29
- DONE Track https://jira.suse.com/browse/ENGINFRA-1742 "Build OpenQA Environment" for story of the openQA VMs being migrated
- DONE Ensure that we can still login into the machine over ssh
- DONE Ensure that both https://openqa.suse.de as well as https://openqa.nue.suse.com work
- DONE (supposedly nothing to be changed) Update https://wiki.suse.net/index.php/OpenQA where necessary
- DONE Enable salt-minion and salt-master on new-osd again
- DONE Ensure that events to rabbit.suse.de can be published (look for errors in the openqa-webui.service journal)
- DONE During migration work closely with Eng-Infra members conducting the actual VM migration
- DONE Ensure openqa.nue.suse.com DNS record points to the new IP and is included in the certificate generation at https://gitlab.suse.de/openqa/salt-pillars-openqa/-/blob/master/certificates/hosts.sls#L8 again (was removed temporary as workaround by @nicksinger)
- DONE Learn from #132143 what to look out for regarding OSD migration
- DONE Ensure that osd is reachable again after migration from the new location
- DONE for SUSE employees
- for osd workers from all locations, e.g. PRG2, NUE1-SRV1, NUE1-SRV2, FC Basement (some workers still show up as offline but probably should be online)
- DONE (done as only pinging via IPv6 from OSD to other hosts does not work) Ensure that https://monitor.qa.suse.de can still reach and monitor OSD
- DONE Inform users as soon as migration is complete
- DONE Enable https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit again and ensure it's working on new-osd
Updated by okurz over 1 year ago
- Copied from action #132143: Migration of o3 VM to PRG2 - 2023-07-19 size:M added
Updated by okurz over 1 year ago
- Subject changed from Support migration of osd VM to PRG2 to Support migration of osd VM to PRG2 - 2023-08-01
Updated by okurz over 1 year ago
- Subject changed from Support migration of osd VM to PRG2 - 2023-08-01 to Support migration of osd VM to PRG2 - 2023-08-01 size:M
- Description updated (diff)
- Status changed from New to Workable
Updated by okurz over 1 year ago
I wrote in https://suse.slack.com/archives/C04MDKHQE20/p1690262542766069
@John Ford @Moroni Flores @Matthias Griessmeier (CC @here) It's one week until the planned migration of openqa.suse.de: Given the problems observed in o3 migration, in particular the very slow syncing of storage volumes, the still present network problems at NUE1+NUE2 (like https://progress.opensuse.org/issues/133127) taking away ressources for both SUSE QE Tools team as well as Eng-Infra as well as other problems like https://progress.opensuse.org/issues/133250 (https://sd.suse.com/servicedesk/customer/portal/1/SD-128313) I do not see it feasible to go forward with the OSD migration on the planned day. Of course any delay will have it's impact as well and there are personell absences which we have to take into account, e.g. me on vacation 2023-08-12 to 2023-08-27. WDYT? My preliminary suggestion for a migration date would be 2023-08-29 also assuming that OBS migration has sufficiently progressed at that time
Updated by okurz over 1 year ago
- Status changed from Workable to Feedback
- Assignee set to okurz
still waiting for confirmation of change of plans or sticking to old plan in https://suse.slack.com/archives/C04MDKHQE20/p1690315843382579?thread_ts=1690262542.766069&cid=C04MDKHQE20
Updated by okurz over 1 year ago
- Subject changed from Support migration of osd VM to PRG2 - 2023-08-01 size:M to Support migration of osd VM to PRG2 - 2023-08-29 size:M
In the weekly DCT migration call we agreed that we do the migration cutover on 2023-08-29. Many days before we should have access to a new VM with new storage volumes and r/o snapshots of the old and then copy over content and try to connect workers. We can even use the "connect to multiple webUIs" approach here.
Waiting for mcaj to create an Eng-Infra ticket for the preparation of the new VM.
Updated by okurz over 1 year ago
There was a request if we can migrate earlier. My statement in https://suse.slack.com/archives/C04MDKHQE20/p1690810541609959
@John Ford as stated today in the morning for the next step of OSD migration we need the VM in prg2. If that can be provided until tomorrow EOB a cutover of OSD might be possible the week after that
I provided more specific requirements in https://jira.suse.com/browse/ENGINFRA-2524 now:
1. Our latest plan as of 2023-07-26 with mcaj was that we need a copy of the openqa.suse.de VM machine definition, bring it up with new storage devices of equal or bigger size as before plus the most recent available read-only snapshots of storage volumes from NUE1. As soon as we within the LSG QE Tools system have access to the system over ssh we can sync over the content from the read-only snapshots to the new storage targets within the running systems dynamically. This is likely to take multiple days but should be fully under our control.
2. Also please provide access to the hypervisor so that we can control and potentially recover any reboot attempts (Same as done for openqa.opensuse.org). The alternative would be a necessary extended availability of Eng-Infra members to be able react quickly on problems.
3. Please coordinate with LSG QE Tools in Slack #dct-migration, not direct messages to individuals who might be unavailable.
openqa.suse.de VM racktables entry: https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=9456
Details for the LSG QE Tools plans regarding the migration in https://progress.opensuse.org/issues/132146 . Details about the previously conducted o3 migration in https://progress.opensuse.org/issues/132143
Updated by okurz over 1 year ago
- Status changed from Feedback to Blocked
Updated by okurz over 1 year ago
Clarified https://jira.suse.com/browse/ENGINFRA-2524 with mcaj, ndachev, nsinger, … in https://meet.jit.si/suse_qa_tools . We will meet again 2023-08-08 0900Z in https://meet.jit.si/suse_qa_tools . mcaj+ndachev will setup VM copy, setup initial network and provide access to SUSE QE Tools. We will sync over content with rsync from ro-snapshots of old volumes to new fresh data volumes and prepare a seamless cutover of services until 2023-08-29, i.e. change DNS entries when SUSE QE Tools gives the Go-signal. An earlier cutover is unfeasible due to the expected data sync to take multiple days and okurz+nsinger in vacation in 2023-w33+w34
So suggestions for specific steps that we should do as soon as we have access to new-osd:
- ask mcaj+ndachev for device names of "old r/o-snapshot storage volumes" and "new empty storage volumes", let's say "old-assets" will be vdx and "new-assets" will be vdy for example
- Create a screen session so that all the actions run persistently
- Then create filesystem on new-assets, e.g.
mkfs.xfs /dev/vdy
- Then mount old-assets and new-assets, e.g.
mkdir /mnt/{old,new}-assets && mount -o ro /dev/vdx /mnt/old-assets && mount /dev/vdy /mnt/new-assets
- Sync over initially, e.g.
rsync -aHP /mnt/{old,new}-assets/
, and monitor - After some hours give estimate how long the sync is going to take and extrapolate for the other volumes
- Prepare delta-sync, i.e. everything that will be written to old-osd in the meantime, e.g.
rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /mnt/new-assets/
after the sync from the r/o-snapshot finished - Optional: Get fancy with remote mount from old-osd over network to prepare a seamless transition when we can already run from new-osd with assets+results from old-osd until in the background you finish the syncing
Updated by okurz over 1 year ago
- Status changed from Blocked to In Progress
- Assignee changed from okurz to mkittler
Handed over to mkittler.
Updated by okurz over 1 year ago
- Tags changed from infra, osd, prg2, dct migration to infra, osd, prg2, dct migration, mob
Updated by mkittler over 1 year ago
Tests I would do to verify whether OSD can reach workers:
ssh -4 worker29.oqa.prg2.suse.org
curl worker29.oqa.prg2.suse.org:8000
after runningpython3 -m http.server
on the worker. Likely it makes sense to test also some other ports, e.g. from the range mentioned in the "command server" box on https://open.qa/docs/images/architecture.svg.- Check alerts like https://stats.openqa-monitor.qa.suse.de/alerting/grafana/host_up_alert_worker33/view
EDIT: I've just tested on the new openqa.oqa.prg2.suse.org the first two points (used also ports 20013 and 20023 for HTTP traffic) and it works for worker29.oqa.prg2.suse.org (one of the new workers), openqaworker8.suse.de and openqaworker-arm-2.suse.de (two of the old workers). So the developer mode will work again once we migrated the VM and possible ping alerts won't be firing anymore as well.
Updated by openqa_review over 1 year ago
- Due date set to 2023-08-23
Setting due date based on mean cycle time of SUSE QE Tools
Updated by mkittler over 1 year ago
The sync is running since yesterday as mentioned in #132146#note-9.
Updated by tinita over 1 year ago
- Due date changed from 2023-08-23 to 2023-09-08
Updated by mkittler over 1 year ago
We have forgotten to disable the auto-update. So I had to start the sync again and I'm not sure how far it came. I now stopped and masked rebootmgr.service
and stopped and disabled auto-update.timer
to prevent that from happening again.
/etc/fstab
was also broken. Martin fixed it (see https://jira.suse.com/browse/ENGINFRA-2524?focusedCommentId=1285152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-1285152) and also created a backup under /etc/fstab.backup
. Perhaps Salt has messed with it so I stopped and disabled salt-master.service
and salt-minion.service
. I also created a draft for updating our Salt states accordingly: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/941
Updated by mkittler over 1 year ago
# time rsync -aHP /mnt/ro/assets/ /assets/
sending incremental file list
file has vanished: "/mnt/ro/assets/factory/repo/SLE-15-SP4-Online-s390x-GM-Media1"
file has vanished: "/mnt/ro/assets/factory/tmp/public/11822757/SLES-15-SP4-x86_64-:30223:erlang-Server-DVD-Incidents@64bit-with-external_testkit.qcow2.CHUNKS/SLES-15-SP4-x86_64-:30223:erlang-Server-DVD-Incidents@64bit-with-external_testkit.qcow2"
rsync: [generator] failed to set times on "/assets/factory/tmp": Read-only file system (30)
factory/tmp/
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1330) [sender=3.2.3]
real 33m35.342s
user 0m11.184s
sys 0m30.274s
The sync for results is still ongoing. The sync for srv has been completed and the no-op doesn't take long. I've just started time rsync -aHP /mnt/ro/space-slow/ /space-slow/
to see how long the no-op for that partition takes.
Updated by mkittler over 1 year ago
The no-op for /space-slow
has been completed as well:
openqa:~ # time rsync -aHP /mnt/ro/space-slow/ /space-slow/
sending incremental file list
…
real 154m35.528s
user 5m22.017s
sys 12m17.096s
It was not really a no-op so I'm doing that again but I guess we're generally ok here.
That only leaves results which is unfortunately not possible due to:
openqa:~ # l /mnt/ro/results/
ls: cannot access '/mnt/ro/results/': Input/output error
We could not resolve this by re-mounting or restarting the VM. I suppose I'll just continue with:
time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
Updated by mkittler over 1 year ago
/space-slow
has just finished as actual no-op without errors:
openqa:~ # time rsync -aHP /mnt/ro/space-slow/ /space-slow/
sending incremental file list
real 99m25.462s
user 2m32.740s
sys 10m37.868s
Considering we have now synced everything from read-only devs (except that some results are missing and the read-only dev is now inaccessible) I'll start syncing from the actual OSD.
Updated by mkittler over 1 year ago
I have now invoked
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/space-slow/ /space-slow/
in different screen sessions.
Updated by mkittler over 1 year ago
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/space-slow/ /space-slow/
…
real 190m29.123s
user 2m51.277s
sys 7m2.874s
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/srv/ /srv/
receiving incremental file list
…
real 26m21.517s
user 5m48.051s
sys 4m8.627s
Updated by mkittler over 1 year ago
The sync of assets and results triggered as mentioned in #132146#note-20 is still ongoing.
Updated by mkittler over 1 year ago
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
testresults/11858/11858999-sle-15-SP5-Server-DVD-HA-Incidents-x86_64-Build:29613:python3-qt5-qam_ha_priority_fencing_node02@64bit/ulogs/
webui/
webui/cache/
webui/cache/asset-status.json
21,691,665 100% 14.19MB/s 0:00:01 (xfr#7083559, to-chk=0/303460578)
real 3286m46.492s
user 94m51.405s
sys 160m53.979s
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
tests/vmdp/.git/index
19,081 100% 18.20MB/s 0:00:00 (xfr#320294, ir-chk=1033/2881761)
tests/vmdp/.git/objects/
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1835) [generator=3.2.3]
real 2103m32.269s
user 234m26.900s
sys 176m52.515s
Updated by okurz over 1 year ago
3286m46.492s means 55h so 2.5d. But that was not the no-op yet, was it? If it would be that would be acceptable for a weekend but I think we can also find better solutions like switching OSD to read-only, only syncing srv with the database, switching on new osd with the in-sync database and then slowly sync the missing results over while new osd is running
Updated by mkittler over 1 year ago
No, that's not no-op yet. Of course it'll never really be no-op as long as the old VM is still producing/changing data. So switching the old VM to read-only like you've mentioned would be helpful. I think we can decide on that when you're back.
Those are now the figures for the next round:
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/
…
testresults/11888/11888999-sle-15-SP5-Server-DVD-Updates-x86_64-Build20230820-1-fips_tests_crypt_krb5_client@64bit/ulogs/
webui/
webui/cache/
webui/cache/asset-status.json
20,678,271 100% 10.08MB/s 0:00:01 (xfr#4206743, to-chk=0/298066900)
real 2353m9.519s
user 71m29.552s
sys 113m48.671s
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
tests/vmdp/.git/objects/
rsync warning: some files vanished before they could be transferred (code 24) at main.c(1835) [generator=3.2.3]
real 1643m17.639s
user 224m10.813s
sys 138m39.520s
So it went down from 2.5 d to 1.634 d. I'll start another round for results and assets because it is supposedly a good idea to catch up.
Updated by mkittler over 1 year ago
openqa:~ # time rsync -aHP --delete --one-file-system openqa.suse.de:/assets/ /assets/
…
real 1194m46.700s
user 144m7.061s
sys 90m15.923s
The last sync of results is still running.
Updated by okurz over 1 year ago
I logged into the VM openqa.oqa.prg2.suse.org, switched to root, attached to the screen session, named the individual screen shells, e.g. rsync-assets
, and restart the syncs for results, assets, srv, space-slow.
I now set the limit max_running_jobs on openqa.suse.de:/etc/openqa/openqa.ini now to 140 as the next step.
EDIT: I also did on old-osd systemctl edit openqa-scheduler
and added
[Service]
…
Environment="MAX_JOB_ALLOCATION=160"
We added a "migration announcement" notice on the index page of both old-osd and new-osd.
I stopped and masked salt-minion and salt-master on old-osd with sudo systemctl mask --now salt-minion salt-master
and disabled https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules , added according steps to "Suggestions" in ticket description.
I setup a recurring sync of space-slow with
while date -Is && sleep 1200; do time rsync -aHP --delete --one-file-system openqa.suse.de:/space-slow/ /space-slow/; done
and accordingly for srv, results, assets. For results as "images/" and "testresults/" are very large I split this up into
while date -Is && sleep 1200; do time rsync -aHP --delete --one-file-system openqa.suse.de:/results/ /results/ --exclude=images/ --exclude=testresults/ && for i in images testresults; do time rsync -aHP --delete --one-file-system openqa.suse.de:/results/$i/ /results/$i/; done; done
Tomorrow morning we should turn old-osd to read-only, sync over /srv, enable /srv bind mount on new-osd, start the database and carefully check the database+webUI. Then trigger sync of assets+results+space-slow. As soon as everything is in sync switch-over DNS and ensure webUI is up, then ensure worker connection. Alternatively enable full services again with incomplete assets+results and sync over assets+results without --delete
while webUI is already running from new location.
- @nicksinger prepare the apache config changes to only allow GET requests
- @nicksinger prepare a merge request for DNS switch-over change in OPS-Service
Updated by nicksinger over 1 year ago
Adding the following to our apach2 config (/etc/apache2/vhosts.d/openqa-common.inc) should switch old OSD into RO mode:
<Location />
<LimitExcept GET>
Require all denied
</LimitExcept>
</Location>
Updated by nicksinger over 1 year ago
prepared DNS entry for new OSD: https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3926
Updated by mkittler over 1 year ago
Updated by tinita over 1 year ago
- Description updated (diff)
Regarding rabbitmq: I pinged Lazaros https://suse.slack.com/archives/C04MDKHQE20/p1693304683467939?thread_ts=1693230882.289569&cid=C04MDKHQE20
Updated by okurz over 1 year ago
We realized that we have a static IP entry on salt controlled hosts, removing those
for i in $(ssh osd-new "sudo salt-key -L" | grep -v ':$'); do ssh -o StrictHostKeyChecking=no $i "sudo sed -i -e '/2620:113:80c0:8080:10:160:0:207/d' -e 's/10.160.0.207/10.145.10.207/' /etc/hosts"; done
By the way, the announcement message on index page should be
🚨 We're currently migrating this service - Please also see the announcement for more details. This is https://openqa.suse.de in the old location. This server is now in read-only mode
and respectively for the new location as well.
Updated by tinita over 1 year ago
tinita wrote in #note-35:
Regarding rabbitmq
Pulishing works again after Lazaros did some firewall adjustment
Updated by nicksinger over 1 year ago
salt-states deployment pipeline cannot access new OSD via ssh. I wrote Lazaros in Slack who helped to unblock this. Pipelines can now access new OSD.
We now also face a similar issue like described in https://progress.opensuse.org/issues/134522 which requires https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/587 and adjustments to the nue.suse.com zone which does not seem possible in the ops-salt repo - I asked how we can do this in slack: https://suse.slack.com/archives/C04MDKHQE20/p1693310485167729
Updated by mkittler over 1 year ago
Looks like the static IP entry is now fixed on all hosts. I'm restarting salt-minion via for i in $( ssh openqa.suse.de 'sudo salt-key --list=accepted | tail -n +2' ) ; do echo "host $i" && ssh -4 -o StrictHostKeyChecking=no "$i" 'sudo systemctl restart salt-minion' ; done
because otherwise it apparently doesn't use the updated entry.
Updated by mkittler over 1 year ago
MR to fix a broken Grafana panel and alert after the migration: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/950
Updated by okurz over 1 year ago
https://openqa.suse.de/tests/11930657 says "Reason: api failure: 400 response: mkdir /var/lib/openqa/testresults/11930/11930657-sle-15-SP4-EC2-BYOS-Updates-x86_64-Build20230828-1-publiccloud_containers@64bit: Permission denied at /usr/lib/perl5/vendor_perl/5.26.1/Mojo/File.pm line 84. ". This looks like an issue on OSD itself. Running chown -R geekotest.nogroup 119*
on OSD now. Unclear what caused this. Running host=openqa.suse.de openqa-advanced-retrigger-jobs
to retrigger incompletes.
announcement sent over email and Slack about completion of the core part of migration:
We are happy to report that https://openqa.suse.de with it's alias https://openqa.nue.suse.com has successfully been migrated from NUE1 to PRG2 datacenter. The instance is eagerly executing test jobs from its new location. We will continue to monitor the system even more closely than usual in the next days. Som background tasks and cleanup is still being conducted. If you find any potentially related issues please report them as usual. Enjoy :)
Updated by okurz over 1 year ago
- Description updated (diff)
https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/950 merged. description updated with resolved tasks
Updated by okurz over 1 year ago
The results sync finished. I would not start it again and trust it's complete now. I also checked permissions again multiple times in /var/lib/openqa/testresults and did not encounter the former problem again about "root.root" ownership.
Updated by okurz over 1 year ago
- Related to action #134816: [tools] grafana dashboard for `OpenQA Jobs test` partially without any data from OSD migration size:M added
Updated by okurz over 1 year ago
reverse DNS entry seems to be invalid, should be fixed by https://gitlab.suse.de/OPS-Service/salt/-/merge_requests/3935/
Updated by okurz over 1 year ago
- Copied to action #134837: SLE test repo not updated on OSD, cron service was not running since 2023-08-29, fetchneedles not called size:M added
Updated by okurz over 1 year ago
Many workers were stuck due to openQA worker processes being stuck trying to read the old stale NFS share from old-OSD. E.g.
# systemctl status openqa-worker-auto-restart@16
● openqa-worker-auto-restart@16.service - openQA Worker #16
Loaded: loaded (/usr/lib/systemd/system/openqa-worker-auto-restart@.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/openqa-worker-auto-restart@.service.d
└─20-nvme-autoformat.conf, 30-openqa-max-inactive-caching-downloads.conf
Active: active (running) since Wed 2023-08-30 11:20:37 CEST; 3h 31min ago
Process: 51156 ExecStartPre=/usr/bin/install -d -m 0755 -o _openqa-worker /var/lib/openqa/pool/16 (code=exited, status=0/SUCCESS)
Main PID: 51157 (worker)
Tasks: 1 (limit: 12287)
CGroup: /openqa.slice/openqa-worker.slice/openqa-worker-auto-restart@16.service
└─ 51157 /usr/bin/perl /usr/share/openqa/script/worker --instance 16
Aug 30 11:20:38 openqaworker14 worker[51157]: [info] [pid:51157] worker 16:
Aug 30 11:20:38 openqaworker14 worker[51157]: - config file: /etc/openqa/workers.ini
Aug 30 11:20:38 openqaworker14 worker[51157]: - name used to register: openqaworker14
Aug 30 11:20:38 openqaworker14 worker[51157]: - worker address (WORKER_HOSTNAME): openqaworker14.qa.suse.cz
Aug 30 11:20:38 openqaworker14 worker[51157]: - isotovideo version: 40
Aug 30 11:20:38 openqaworker14 worker[51157]: - websocket API version: 1
Aug 30 11:20:38 openqaworker14 worker[51157]: - web UI hosts: openqa.suse.de
Aug 30 11:20:38 openqaworker14 worker[51157]: - class: qemu_x86_64,qemu_x86_64_staging,qemu_x86_64-large-mem,windows11,wsl2,platform_intel,prg,prg_office,openqaworker14
Aug 30 11:20:38 openqaworker14 worker[51157]: - no cleanup: no
Aug 30 11:20:38 openqaworker14 worker[51157]: - pool directory: /var/lib/openqa/pool/16
same on other machine and reason is a process being stuck, see
$ ssh openqaworker17.qa.suse.cz
Have a lot of fun...
okurz@openqaworker17:~> ps auxf | grep '\<D\>'
root 2775 0.0 0.0 36596 8264 ? Ss Aug27 0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
okurz 64738 0.0 0.0 8208 772 pts/0 S+ 14:55 0:00 \_ grep --color=auto \<D\>
root 34646 0.0 0.0 5520 728 ? D 10:59 0:00 \_ df
the "df" call is by the openQA worker cache service I think.
That is due to the NFS share on /var/lib/openqa/share.
Now we called
salt \* cmd.run 'ps -eo pid,stat | grep " D\>" && reboot'
to reboot all machines with stuck processes. Created #134846 as improvement.
EDIT: https://openqa.suse.de/tests shows currently 340 concurrent running jobs, limited by server config. So good so far.
Updated by mkittler over 1 year ago
To actually fix the host alert: https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/955
EDIT: The MR has been merged and now the query in the alert is correct (see https://stats.openqa-monitor.qa.suse.de/alerting/grafana/tm0h5mf4k/view).
Updated by mkittler over 1 year ago
To fix the remaining alerts we have to either remove the IPv6 AAAA records in the DNS config, fix IPv6 or only use IPv4 explicitly in our monitoring. Possibly we can also apply the same workaround as we already have on o3 (#133403#note-5).
Updated by okurz over 1 year ago
- Related to action #134846: Old NFS share mount is keeping processes stuck and openQA workers seem up but do not work on jobs added
Updated by mkittler over 1 year ago
- Description updated (diff)
The backup of the new OSD VM on the backup VM works, e.g. /home/rsnapshot/alpha.0/openqa.suse.de/etc/fstab
is current.
I've also just disabled the root login on OSD.
Those were the only item from o3's list that applies here and hasn't already been done or mentioned anyways.
Updated by mkittler over 1 year ago
Looks like many workers are not connected. I suppose the following can be ignored:
- worker11.oqa.suse.de: not reachable via SSH, no output visible via SOL, not accepted via salt anyways
- worker12.oqa.suse.de: reachable but no slots online, not accepted via salt anyways
- openqaworker6: moved to o3, see #129484
- openqaworker-arm-1: moved to NUE2, not fully back yet
- grenache-1: deracked for now
This leaves:
- openqaworker9: just slot 12 and 15
- worker40: shown as offline despite services running
- broken slots on sapworker1, worker2 and a few of the worker3x ones
Updated by mkittler over 1 year ago
The openqaworker9 slots are actually just leftovers which I've removed.
The broken slots are just a displaying error. When waiting shortly they show up as idle and then some other set of slots shows up as broken. I think this affects slots that have just finished a job. I guess this is something for another ticket (the graceful disconnect feature likely needs to be reworked).
worker40 was caused by a hanging nfs mount and our previous rebooting didn't take the worker into account. So I've just rebooted it now. Now the workers don't hang anymore but are still unable to connect:
Aug 30 18:03:03 worker40 worker[18479]: [warn] [pid:18479] Failed to register at openqa.suse.de - connection error: Connection refused - trying again in 10 seconds
So only worker40 is still problematic. I'll look into it tomorrow.
Updated by okurz over 1 year ago
- Related to action #134879: reverse DNS resolution PTR for openqa.oqa.prg2.suse.org. yields "3(NXDOMAIN)" for PRG1 workers (NUE1+PRG2 are fine) size:M added
Updated by okurz over 1 year ago
- Copied to action #134888: Ensure no job results are present in the file system for jobs that are no longer in the database - on OSD only added
Updated by okurz over 1 year ago
- Description updated (diff)
mkittler wrote in #note-59:
So only worker40 is still problematic. I'll look into it tomorrow.
That is better done in #132137 though. I split out AC5 into action #134888: Ensure no job results are present in the file system for jobs that are no longer in the database - on OSD only . With that this leaves https://gitlab.suse.de/openqa/osd-deployment/-/pipeline_schedules/36/edit which I enabled now and triggered as https://gitlab.suse.de/openqa/osd-deployment/-/pipelines/784218
Updated by okurz over 1 year ago
- Related to action #134900: salt states fail to apply due to "Pillar openqa.oqa.prg2.suse.org.key does not exist" added
Updated by mkittler over 1 year ago
Yes, since only worker40 was remaining (which is one of the new Prage located workers) it could have been done as part of #132137. However, I was of course also checking any other workers. I was able to fix the connectivity issues with worker40 now anyways. It was not in salt because I only added it after we setup the new VM and the change hasn't been carried over. That also mean it still had the old static IP entry in /etc/hosts
which prevented it from even showing up as unaccepted host in salt key (so it wasn't noticed).
By the way, I'm not sure whether the workers showing as broken with "graceful disconnect" are really just displaying errors. I think the problem is that these workers have been registering themselves via the API but are still waiting for the websocket server¹ to respond. So their basically still in progress of establishing a full connection. That state is not handled well by our displaying code but also problematic on its own as it means that the websocket server is overloaded. I suppose I should create a ticket for that.
¹ e.g. worker slots are really very long at
Aug 31 12:11:55 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:11:56 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:16:56 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:17:06 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:17:10 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
without the Registered and connected via websockets …
line showing up yet.
Maybe I should create a separate ticket for that.
EDIT: It looks like this isn't even working at all or at least some timeout has been hit:
Aug 31 12:11:55 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:11:56 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:16:56 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:17:06 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:17:10 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:22:10 worker40 worker[122368]: [warn] [pid:122368] Unable to upgrade to ws connection via http://openqa.suse.de/api/v1/ws/3108, code 502 - trying again in 10 seconds
Aug 31 12:22:20 worker40 worker[122368]: [info] [pid:122368] Registering with openQA openqa.suse.de
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Establishing ws connection via ws://openqa.suse.de/api/v1/ws/3108
Aug 31 12:27:09 worker40 worker[122368]: [info] [pid:122368] Registered and connected via websockets with openQA host openqa.suse.de and worker ID 3108
Considering the timestamps are exactly 5 minutes apart (most likely the gateway timeout for the websocket connection) and it worked on the next attempt it is likely just the websocket server being severely overloaded.
EDIT: I've created a ticket for that #134924.
Updated by okurz over 1 year ago
- Due date deleted (
2023-09-08) - Status changed from In Progress to Resolved
Then we are good