action #54902
closedopenQA on osd fails at "incomplete" status when uploading, "502 response: Proxy Error"
0%
Description
See https://openqa.suse.de/tests/3187414 for example, the tests run through OK, but I can notice that it stays at "uploading" status for a long time, then it changes to "incomplete" and gets to restart automatically.
https://openqa.suse.de/tests/3187414/file/autoinst-log.txt shows:
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Failed uploading chunk
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Error uploading SLED-12-SP5-0206-x86_64_for_regression.qcow2: 502 response: Proxy Error
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Upload failed for chunk 1262
Updated by coolo about 5 years ago
- Category set to Regressions/Crashes
- Priority changed from Normal to High
- Target version set to Current Sprint
This looks severe - no idea where this is coming from
Updated by livdywan about 5 years ago
I'm seeing at least 3 variants between the clones:
[2019-07-31T04:34:32.0860 CEST] [info] [pid:39796] Uploading logs_from_installation_system-y2logs.tar.bz2
[2019-07-31T04:34:33.0564 CEST] [info] [pid:39796] Uploading SLED-12-SP5-0206-x86_64_for_regression.qcow2
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Failed uploading chunk
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Error uploading SLED-12-SP5-0206-x86_64_for_regression.qcow2: 502 response: Proxy Error
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Upload failed for chunk 1262
[2019-07-31T05:12:03.0469 CEST] [info] [pid:39796] Uploading video.ogv
SLE-12-SP5-Desktop-DVD-x86_64-Build0206-Media1.iso
available but no log or other assets
[2019-07-31T06:06:28.691 CEST] [debug] killing command server 17377 because test execution ended
[2019-07-31T06:06:28.691 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/leIhI7s9iMQRNMaB/broadcast
[2019-07-31T06:06:43.702 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 172.
[2019-07-31T06:06:44.704 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.
[2019-07-31T06:06:44.711 CEST] [debug] commands process exited: 0
[2019-07-31T06:06:45.712 CEST] [debug] done with command server
[2019-07-31T06:06:45.712 CEST] [debug] isotovideo done
Updated by zcjia about 5 years ago
I canceled this job yesterday and restarted it today, and it passed: https://openqa.suse.de/tests/3198655
I wonder if it's related with recent openQA deployment or OSD site asset cleanup.
Updated by mkittler about 5 years ago
- Category deleted (
Regressions/Crashes) - Priority changed from High to Normal
- Target version deleted (
Current Sprint)
Moving assets certainly used a lot of bandwith and may be the cause of some of the connection problems. However, not all of these failures happened at the time the assets were moved.
The deployment could also have caused some of the failures it required to restart web UI and websocket server. Especially restarting the websocket server has not been handled well by certain worker versions. However, this should be fixed right now.
In the last 500 jobs there are only a few incompletes anymore which failed for multiple/different reasons.
Updated by mkittler about 5 years ago
- Category set to Regressions/Crashes
- Target version set to Current Sprint
Updated by JRivrain about 5 years ago
I have reported a pretty similar problem here: https://progress.opensuse.org/issues/54869, but in this case we do not always have the "502 response" error in the logs.
It is blocking several of our tests because it happens almost everytime on aarch64
Updated by coolo about 5 years ago
- Priority changed from Normal to Urgent
Marius happened to delete the priority - now raising it even.
The issue already bothered sebchlad over the weekend, so it's completely unrelated to the asset moving
Updated by okurz about 5 years ago
- Related to action #54869: improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.) added
Updated by okurz about 5 years ago
- Status changed from New to In Progress
- Assignee set to okurz
#54869 has more information as well. As we saw over the past days the upload speeds on various jobs are simply abysmal. Checking the load on osd I can that it has a general load around 12 which is pretty high but not dead-locked :) Wasn't there some problem with the database lately that coolo "fixed" by doing … something?
On osd in htop sorting by "DISK R/W" I can see:
PID USER PRI IO NI OOM VIRT RES SHR S DISK R/W CPU% MEM% TIME+ Command
8453 postgres 20 B4 0 9 220M 154M 145M D 8.29 M/s 29.2 1.0 4:37.06 postgres: geekotest openqa [local] SELECT
24819 geekotest 20 B4 0 12 362M 202M 8872 S 4.23 M/s 13.3 1.3 5:45.43 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
1647 geekotest 20 B4 0 11 345M 185M 8856 S 2.75 M/s 16.5 1.2 1:46.28 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
18322 geekotest 20 B4 0 12 359M 196M 8872 R 2.61 M/s 13.3 1.2 3:09.33 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
8360 geekotest 20 B4 0 12 353M 193M 8864 S 2.31 M/s 15.2 1.2 1:03.06 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
1353 geekotest 20 B4 0 11 353M 193M 8848 S 2.13 M/s 9.5 1.2 1:43.96 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
18958 geekotest 20 B4 0 11 347M 183M 4796 S 2.12 M/s 10.2 1.2 0:19.66 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
16021 geekotest 20 B4 0 11 348M 188M 8880 S 2.10 M/s 14.0 1.2 3:24.29 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
11390 geekotest 20 B4 0 11 351M 191M 8856 S 2.08 M/s 23.5 1.2 0:45.34 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
32123 geekotest 20 B4 0 12 357M 197M 8880 S 2.00 M/s 8.2 1.2 1:53.28 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
17605 geekotest 20 B4 0 11 354M 190M 4804 S 1.69 M/s 19.7 1.2 0:24.26 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
25508 postgres 20 B4 0 9 218M 151M 145M S 1.52 M/s 11.4 1.0 0:10.37 postgres: geekotest openqa [local] UPDATE
19614 geekotest 20 B4 0 10 335M 175M 8840 S 1.33 M/s 12.1 1.1 0:18.88 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
28166 wwwrun 20 B4 0 0 113M 6540 4236 S 1.23 M/s 0.6 0.0 0:07.45 /usr/sbin/httpd-prefork -DSYSCONFIG -DSSL -DSTATUS -DHTTPS -C PidFile /var/run/httpd.pid -C Include /etc/apache2/sysconfig.d//loadmodule.conf -C Include /etc/apache2/sysconfig.d//global.conf -f /etc/apache2/httpd.conf -c Include /etc/
28486 geekotest 20 B4 0 11 352M 192M 8856 S 1.18 M/s 7.6 1.2 2:12.21 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
1138 geekotest 20 B4 0 13 374M 213M 9484 S 916.30 K/s 8.9 1.3 5:07.39 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
20079 geekotest 20 B4 0 11 343M 183M 8848 S 896.16 K/s 7.6 1.2 0:14.97 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
9812 geekotest 20 B4 0 11 339M 179M 8872 S 893.64 K/s 11.4 1.1 3:38.16 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
11582 geekotest 20 B4 0 12 361M 200M 8864 S 838.26 K/s 11.4 1.3 3:45.12 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
26380 geekotest 20 B4 0 12 359M 199M 8856 S 810.57 K/s 8.9 1.3 2:21.32 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
16663 geekotest 20 B4 0 12 364M 205M 8880 S 634.36 K/s 8.2 1.3 3:21.05 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
2720 geekotest 20 B4 0 12 354M 194M 8864 S 624.29 K/s 10.2 1.2 4:32.71 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
21473 geekotest 20 B4 0 13 372M 212M 8872 S 621.77 K/s 9.5 1.3 2:42.86 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
22875 geekotest 20 B4 0 0 0 0 0 Z 614.22 K/s 5.1 0.0 0:00.08 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
31263 geekotest 20 B4 0 11 352M 192M 8880 S 468.22 K/s 10.8 1.2 1:59.89 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
25579 geekotest 20 B4 0 11 348M 187M 8880 S 463.18 K/s 23.5 1.2 2:20.43 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
640 geekotest 20 B4 0 11 351M 187M 8856 S 430.46 K/s 7.6 1.2 1:43.48 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
8687 postgres 20 B4 0 8 210M 133M 133M S 302.08 K/s 0.6 0.8 3h59:30 postgres: writer process
25610 postgres 20 B4 0 9 218M 152M 145M S 166.14 K/s 1.9 1.0 0:08.17 postgres: geekotest openqa [local] idle
12926 geekotest 20 B4 0 11 344M 180M 4804 S 98.17 K/s 17.8 1.1 0:38.47 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
12415 postgres 20 B4 0 9 216M 149M 145M S 60.42 K/s 1.9 0.9 0:03.72 postgres: geekotest openqa [local] idle
12395 geekotest 20 B4 0 10 331M 168M 4804 S 57.90 K/s 25.4 1.1 0:41.37 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
which looks like there are quite some connections to workers transferring with good bandwidth, probably uploading. However the postgres: geekotest openqa [local] SELECT
could be something fishy but also that process just disappeared afterwards so not that bad.
Updated by okurz about 5 years ago
By now, Saturday morning, the load has decreased. OSD has a mean load of below 4.0 . Also currently there are no jobs in uploading . I am reviewing new test failures but so far no severe openQA or infrastructure related problems. I can see that all jobs that had problems in uploading, i.e. ones that went incomplete in the end, were restarted and all in the end succeeded so I am not even sure if we have any regression here or just even more tests that were scheduled causing an overload. Especially s390x-kvm-sle12 seems to lag behind and still has some scheduled jobs.
I have the feeling that the problem that we originally wanted to fix with "caching" is not really fixed but moved to the "uploading" side. (IMHO – as I originally raised for caching – this is a problem that openQA should not even originally have tried to solve but the infrastructure. If it would be for me I would not have any caching nor chunk uploading in openQA.)
Ideas I have for improvement:
- Stresstests for openQA, e.g. start 20 workers which all want to upload in parallel
- Serialize the "create_hdd" scenarios to prevent too many parallel jobs
- Limit the number of concurrent upload jobs on the websockets server side
- Actually reduce the numbers of possible apache workers to prevent I/O overload
- Queue up jobs for uploading by letting them wait before "Uploading" until there is enough ressource on the webUI host available again
- Allow longer time lingering in "Uploading", e.g. Increase the timeout within the websocket connections (if this makes sense)
- Improve feedback for jobs in "Uploading"
- prevent the repeating auto-reloads of the webpage
- provide the current status from logs over the webUI
- Think about more efficient use of assets:
- Why would a worker host need to upload images to another machine when afterwards the image is only used on the same worker host?
- Reuse caching for uploading as well?
Updated by okurz about 5 years ago
- Status changed from In Progress to Feedback
there is nothing I am currently doing except for monitoring.
Updated by okurz about 5 years ago
Discussed with mkittler and he created #55112 in response. Nothing more is planned on short-term so we could even reduce prio again but I think keeping it in "Feedback" on "Urgent" is also ok for now because we all need to be aware of this problem even if not doing anything except for waiting.
Updated by kraih about 5 years ago
I've looked into the chunked upload code a bit, and it does look fairly inefficient. If that's a reason for this issue i'm not sure. But since nobody dares touch the code, it will need to be cleaned up at some point anyway. The worker would also benefit a lot from making the uploader non-blocking, allowing for status updates to the WebSocket server at all times. This would be a new epic ticket though, and okurz made it clear that he would like to collect more data before we do anything, so i'll hold off for now.
Updated by okurz about 5 years ago
Discussed the topic during the team meetup on 2019-08-06 and we have elaborated the point "limit concurrent update jobs on server side" as well as a necessary rework of uploading code vs. downloading.
Updated by okurz about 5 years ago
- Copied to action #55154: osd reporting monitoring alerts about "CRITICAL - Socket timeout after 16 seconds Service: HTTP access openqa" and "CRITICAL - I/O stats tps=2108.80 KB_read/s=322876.80 KB_written/s=100.80 iowait=0.88 Service: iostat openqa" added
Updated by okurz about 5 years ago
- Copied to action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts added
Updated by okurz about 5 years ago
- Copied to action #55313: long running (minutes) postgres SELECT calls on osd added
Updated by okurz about 5 years ago
Still there is very high load on osd. Trying more severe measures to decrease it now: Stopping the service "telegraf" as that daemon seems to have quite some CPU usage and we are currently closely monitoring the instance manually anyway.
Updated by okurz about 5 years ago
after the a bit more lazy weekend osd is back to high load in the 20's, sometimes 30's. As "collectd" is now reporting the error "write_graphite plugin: Connecting to openqa-monitoring.qa.suse.de:2003 via tcp failed. The last error was: failed to connect to remote host: No route to host" and it is very much visible in the CPU time I am also stopping the service for now:
systemctl stop collectd
IMHO both telegraf and collectd should run with higher nice level (CPU+I/O). What could also be done is to run rsyncd with a higher nice same also for vsftpd as the services directly on the instance should have priority over the non-interactive remote access.
Updated by riafarov about 5 years ago
In the latest build of SLE 12 job didn't get incomplete, but image was not properly published: https://openqa.suse.de/tests/3245595/file/autoinst-log.txt
This looks worse than before as appears that everything is fine and then it's visible only on child jobs, which should not be triggered at all.
[2019-08-14T01:52:10.0383 UTC] [error] [pid:18763] Failed uploading chunk
[2019-08-14T01:52:10.0384 UTC] [error] [pid:18763] Error uploading sle-12-SP5-aarch64-Build0265-Server-DVD@aarch64-gnome-encrypted.qcow2: 502 response: Proxy Error
[2019-08-14T01:52:10.0384 UTC] [error] [pid:18763] Upload failed for chunk 837
Updated by okurz about 5 years ago
I am trying to reduce the overload by stopping workers, e.g. I stopped all worker instances on openqaworker9 now:
okurz@openqa:~> sudo salt -C 'G@roles:worker' cmd.run 'systemctl status openqa-worker@\* | grep -c "Active: active"'
…
openqaworker9.suse.de:
24
okurz@openqa:~> sudo salt -C 'openqaworker9*' cmd.run 'systemctl stop openqa-worker.target'
openqaworker9.suse.de:
okurz@openqa:~> sudo salt -C 'G@roles:worker' cmd.run 'systemctl status openqa-worker@\* | grep -c "Active: active"'
…
openqaworker9.suse.de:
0
Also I temporarily disabled the build view on the index page with an override template:
openqa:/etc/openqa/templates # cat main/index.html.ep
% layout 'bootstrap';
% title '';
%= include 'layouts/info'
% content_for 'ready_function' => begin
setupIndexPage();
hideNavbar();
alignBuildLabels();
% end
<div class="jumbotron">
<div class='container'>
<div class="row">
<div class="col-md-9">
%= include_branding 'docbox'
</div>
<div class="col-md-3 hidden-sm-down">
%= include_branding 'sponsorbox'
</div>
</div>
</div>
</div>
<div id="build-results">
Temporarily disabled due to performance reasons. See <a href="https://progress.opensuse.org/issues/54902">https://progress.opensuse.org/issues/54902</a> for details
</div>
The page still does not load very fast, probably the lookup of job groups is also not that efficient.
I nice'd rsyncd, salt-master, obslisten, smbd, vsftpd, cupsd (?), salt-minion
Updated by okurz about 5 years ago
Also stopped openqaworker6 now. Problem is that they are re-enabled. Need to change salt config: https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/8f933532b2abca2c324dbb4d3f438464afda6edd but haven't yet restarted workers or something as there so many jobs running.
EDIT (after the night): The change seems to be effective now, less worker instances running
Also stopped many instances on openqaworker13 with https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/d4fb47ae2d7ba74a5aa55ae50a9b919bc2dc933a as I have the feeling this helps.
I created an infra ticket to ask if the nr. of CPUs can be increased.
Updated by okurz about 5 years ago
https://openqa.suse.de/tests/3256416 is a recent example:
openqa:/home/okurz # grep 3256416 /var/log/openqa
[2019-08-16T06:15:17.0953 CEST] [warn] dead job 3256416 aborted and duplicated 3257259
[2019-08-16T06:15:23.0943 CEST] [info] Got status update for job 3256416 with unexpected worker ID 1123 (expected no updates anymore, job is done with result incomplete)
…
cloned https://openqa.suse.de/tests/3257259 is fine but the overall upload process took 1.5h as can be seen in https://openqa.suse.de/tests/3257259/file/worker-log.txt
Updated by okurz about 5 years ago
Seems that at least one person (asmorodskyi) was confused with my message that I put on the index page so I changed the text to "Build results on the index page are temporarily disabled due to performance reasons, Use /tests or the job group pages as usual" instead.
Updated by okurz about 5 years ago
- Related to action #55628: [qam][incomplete] test fails in bind with "zypper -n si bind' failed with code 105" but logs look fine? added
Updated by okurz about 5 years ago
As often, could be the problems we observe are older than the ticket reporting time. I suspect the "worker rework" from Martchus as biggest trigger, e.g. also see #55628 about problems to install packages from osd repos from 21 days ago. But – as always – could also be we had underlying problems, e.g. limited performance of osd, that are just made more obvious or more severe with recent code changes.
Updated by okurz about 5 years ago
As multiple persons were mistaken by the index page replacement we have disabled it again for now. Apparently some have not read the text :(
EDIT: Crosschecking with @mkittler we found out that I did not use it to the full extent as the template simply did not render results which can help a little bit with less perl code executed but the bunch of time saving is when we skip over the complete query in the backend code.
Updated by okurz about 5 years ago
- Copied to action #55691: ask Engineering Infrastructure to change monitoring ping check target from / to /tests to reduce performance impact added
Updated by okurz about 5 years ago
coolo has pointed out that the obs sync plugin was using very inefficient, blocking calls to external tools which can clog the mojo workers completely until the requests are completed which can lead to the severe timeouts as we have seen. Good that the still open PR https://github.com/os-autoinst/openQA/pull/2225 did not really make that obvious :(
Updated by coolo about 5 years ago
This wasn't the culprit though. I pimped up our monitoring, but it didn't help.
The disk just stalls without warning - Rudi and I can see the same on the underlying VM host and it turns out the link to the netapp generates a lot of Tx errors. Let's hope this can be fixed.
Updated by coolo about 5 years ago
- Status changed from Feedback to Resolved
Quoting Rudi:
turned out to be NIC errors on the connection
netapp01 port e3a <-> nx5672up-srv1-d-4a port 2/11
the stalls were caused by the error transmissions and after retransmit timeout
the packet was sent again and things went on.
cleaned both ends of the FC cable, no new errors since about 30 mins now
Updated by okurz about 5 years ago
- Status changed from Resolved to In Progress
great that we found a deeper solution that helps. If you don't mind, I will keep the ticket open for now until all things we also reverted or temporarily switched off have brought back again, e.g. the reduced worker count.
Updated by okurz about 5 years ago
- Status changed from In Progress to Feedback
bringing back all workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/187
Ideas I would like to discuss first before closing:
- Improvement ideas as mentioned in #54902#note-10
- running telegraf, collectd, rsyncd, salt-master, obslisten, smbd, vsftpd, cupsd (?), salt-minion with higher nice level (#54902#note-19 and #54902#note-21)
Updated by coolo about 5 years ago
collectd and cups are gone and not sure what obslisten is.
vsftpd, smbd and rsyncd have all in common that they are generally very low on CPU usage, so a nice level won't help you. The IO they're doing is going to doom you if your disk bandwith is low - and nice level won't help you there as virtual disks use deadline IO scheduler.
Leaves telegraf - it's using quite a lot of CPU as it's parsing quite some log files all the time. But in case something goes wrong, it's our only hope to find out what's going wrong. It's not hogging the CPU whatsoever as it runs in intervals - and if needed we can enlarge the interval, but increasing its nice level would only result in 'we know something was strange at 14:00 as the load increased but after that we can't say what happened as we have no data'. See https://en.wikipedia.org/wiki/Schrödinger%27s_cat for details
Updated by okurz about 5 years ago
coolo wrote:
[…] not sure what obslisten is.
the obs rsync plugin or daemon (?).
Leaves the suggestions from #54902#note-10
Updated by mkittler about 5 years ago
- Target version changed from Current Sprint to Done
Updated by mkittler about 5 years ago
- Target version changed from Done to Current Sprint
Updated by JERiveraMoya about 5 years ago
- Copied to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added
Updated by JERiveraMoya about 5 years ago
- Copied to deleted (action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs)
Updated by okurz almost 5 years ago
This is an autogenerated message for openQA integration by the openqa_review script:
This bug is still referenced in a failing openQA test: minimal+base+sdk+proxy_SCC-postreg
https://openqa.suse.de/tests/3388694
Updated by okurz about 2 years ago
- Related to coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend added