Project

General

Profile

Actions

action #54902

closed

openQA on osd fails at "incomplete" status when uploading, "502 response: Proxy Error"

Added by zcjia over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Regressions/Crashes
Target version:
Start date:
2019-07-31
Due date:
% Done:

0%

Estimated time:

Description

See https://openqa.suse.de/tests/3187414 for example, the tests run through OK, but I can notice that it stays at "uploading" status for a long time, then it changes to "incomplete" and gets to restart automatically.

https://openqa.suse.de/tests/3187414/file/autoinst-log.txt shows:

[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Failed uploading chunk
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Error uploading SLED-12-SP5-0206-x86_64_for_regression.qcow2: 502 response: Proxy Error
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Upload failed for chunk 1262


Related issues 7 (1 open6 closed)

Related to openQA Project - action #54869: improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.)Resolvedokurz2019-07-30

Actions
Related to openQA Tests - action #55628: [qam][incomplete] test fails in bind with "zypper -n si bind' failed with code 105" but logs look fine?Resolved2019-08-16

Actions
Related to openQA Infrastructure - coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekendResolvedokurz2022-06-22

Actions
Copied to openQA Infrastructure - action #55154: osd reporting monitoring alerts about "CRITICAL - Socket timeout after 16 seconds Service: HTTP access openqa" and "CRITICAL - I/O stats tps=2108.80 KB_read/s=322876.80 KB_written/s=100.80 iowait=0.88 Service: iostat openqa"Resolvedokurz2019-07-31

Actions
Copied to openQA Project - action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attemptsNew2019-07-31

Actions
Copied to openQA Project - action #55313: long running (minutes) postgres SELECT calls on osdRejectedokurz2019-07-31

Actions
Copied to openQA Infrastructure - action #55691: ask Engineering Infrastructure to change monitoring ping check target from / to /tests to reduce performance impactResolvedokurz2019-07-31

Actions
Actions #1

Updated by coolo over 4 years ago

  • Category set to Regressions/Crashes
  • Priority changed from Normal to High
  • Target version set to Current Sprint

This looks severe - no idea where this is coming from

Actions #2

Updated by livdywan over 4 years ago

I'm seeing at least 3 variants between the clones:

3187414

[2019-07-31T04:34:32.0860 CEST] [info] [pid:39796] Uploading logs_from_installation_system-y2logs.tar.bz2
[2019-07-31T04:34:33.0564 CEST] [info] [pid:39796] Uploading SLED-12-SP5-0206-x86_64_for_regression.qcow2
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Failed uploading chunk
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Error uploading SLED-12-SP5-0206-x86_64_for_regression.qcow2:     502 response: Proxy Error
[2019-07-31T05:05:29.0311 CEST] [error] [pid:39796] Upload failed for chunk 1262
[2019-07-31T05:12:03.0469 CEST] [info] [pid:39796] Uploading video.ogv

3189506 3191057

SLE-12-SP5-Desktop-DVD-x86_64-Build0206-Media1.iso available but no log or other assets

3189714

[2019-07-31T06:06:28.691 CEST] [debug] killing command server 17377 because test execution ended
[2019-07-31T06:06:28.691 CEST] [debug] isotovideo: informing websocket clients before stopping command server: http://127.0.0.1:20123/leIhI7s9iMQRNMaB/broadcast
[2019-07-31T06:06:43.702 CEST] [debug] isotovideo: unable to inform websocket clients about stopping command server: Request timeout at /usr/bin/isotovideo line 172.
[2019-07-31T06:06:44.704 CEST] [error] can_read received kill signal at /usr/lib/os-autoinst/myjsonrpc.pm line 91.
[2019-07-31T06:06:44.711 CEST] [debug] commands process exited: 0
[2019-07-31T06:06:45.712 CEST] [debug] done with command server
[2019-07-31T06:06:45.712 CEST] [debug] isotovideo done
Actions #3

Updated by zcjia over 4 years ago

I canceled this job yesterday and restarted it today, and it passed: https://openqa.suse.de/tests/3198655

I wonder if it's related with recent openQA deployment or OSD site asset cleanup.

Actions #4

Updated by mkittler over 4 years ago

  • Category deleted (Regressions/Crashes)
  • Priority changed from High to Normal
  • Target version deleted (Current Sprint)

Moving assets certainly used a lot of bandwith and may be the cause of some of the connection problems. However, not all of these failures happened at the time the assets were moved.

The deployment could also have caused some of the failures it required to restart web UI and websocket server. Especially restarting the websocket server has not been handled well by certain worker versions. However, this should be fixed right now.

In the last 500 jobs there are only a few incompletes anymore which failed for multiple/different reasons.

Actions #5

Updated by mkittler over 4 years ago

  • Category set to Regressions/Crashes
  • Target version set to Current Sprint
Actions #6

Updated by JRivrain over 4 years ago

I have reported a pretty similar problem here: https://progress.opensuse.org/issues/54869, but in this case we do not always have the "502 response" error in the logs.
It is blocking several of our tests because it happens almost everytime on aarch64

Actions #7

Updated by coolo over 4 years ago

  • Priority changed from Normal to Urgent

Marius happened to delete the priority - now raising it even.

The issue already bothered sebchlad over the weekend, so it's completely unrelated to the asset moving

Actions #8

Updated by okurz over 4 years ago

  • Related to action #54869: improve feedback in case job is incompleted due to too long uploading (was: Test fails as incomplete most of the time, no clue what happens from the logs.) added
Actions #9

Updated by okurz over 4 years ago

  • Status changed from New to In Progress
  • Assignee set to okurz

#54869 has more information as well. As we saw over the past days the upload speeds on various jobs are simply abysmal. Checking the load on osd I can that it has a general load around 12 which is pretty high but not dead-locked :) Wasn't there some problem with the database lately that coolo "fixed" by doing … something?

On osd in htop sorting by "DISK R/W" I can see:

  PID USER      PRI IO  NI   OOM  VIRT   RES   SHR S    DISK R/W CPU% MEM%   TIME+  Command                                                                                                                                                                                                                                   
 8453 postgres   20 B4   0     9  220M  154M  145M D    8.29 M/s 29.2  1.0  4:37.06 postgres: geekotest openqa [local] SELECT                                                                                                                                                                                                 
24819 geekotest  20 B4   0    12  362M  202M  8872 S    4.23 M/s 13.3  1.3  5:45.43 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
 1647 geekotest  20 B4   0    11  345M  185M  8856 S    2.75 M/s 16.5  1.2  1:46.28 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
18322 geekotest  20 B4   0    12  359M  196M  8872 R    2.61 M/s 13.3  1.2  3:09.33 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
 8360 geekotest  20 B4   0    12  353M  193M  8864 S    2.31 M/s 15.2  1.2  1:03.06 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
 1353 geekotest  20 B4   0    11  353M  193M  8848 S    2.13 M/s  9.5  1.2  1:43.96 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
18958 geekotest  20 B4   0    11  347M  183M  4796 S    2.12 M/s 10.2  1.2  0:19.66 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
16021 geekotest  20 B4   0    11  348M  188M  8880 S    2.10 M/s 14.0  1.2  3:24.29 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
11390 geekotest  20 B4   0    11  351M  191M  8856 S    2.08 M/s 23.5  1.2  0:45.34 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
32123 geekotest  20 B4   0    12  357M  197M  8880 S    2.00 M/s  8.2  1.2  1:53.28 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
17605 geekotest  20 B4   0    11  354M  190M  4804 S    1.69 M/s 19.7  1.2  0:24.26 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
25508 postgres   20 B4   0     9  218M  151M  145M S    1.52 M/s 11.4  1.0  0:10.37 postgres: geekotest openqa [local] UPDATE
19614 geekotest  20 B4   0    10  335M  175M  8840 S    1.33 M/s 12.1  1.1  0:18.88 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
28166 wwwrun     20 B4   0     0  113M  6540  4236 S    1.23 M/s  0.6  0.0  0:07.45 /usr/sbin/httpd-prefork -DSYSCONFIG -DSSL -DSTATUS -DHTTPS -C PidFile /var/run/httpd.pid -C Include /etc/apache2/sysconfig.d//loadmodule.conf -C Include /etc/apache2/sysconfig.d//global.conf -f /etc/apache2/httpd.conf -c Include /etc/
28486 geekotest  20 B4   0    11  352M  192M  8856 S    1.18 M/s  7.6  1.2  2:12.21 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
 1138 geekotest  20 B4   0    13  374M  213M  9484 S  916.30 K/s  8.9  1.3  5:07.39 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
20079 geekotest  20 B4   0    11  343M  183M  8848 S  896.16 K/s  7.6  1.2  0:14.97 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
 9812 geekotest  20 B4   0    11  339M  179M  8872 S  893.64 K/s 11.4  1.1  3:38.16 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
11582 geekotest  20 B4   0    12  361M  200M  8864 S  838.26 K/s 11.4  1.3  3:45.12 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
26380 geekotest  20 B4   0    12  359M  199M  8856 S  810.57 K/s  8.9  1.3  2:21.32 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
16663 geekotest  20 B4   0    12  364M  205M  8880 S  634.36 K/s  8.2  1.3  3:21.05 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
 2720 geekotest  20 B4   0    12  354M  194M  8864 S  624.29 K/s 10.2  1.2  4:32.71 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
21473 geekotest  20 B4   0    13  372M  212M  8872 S  621.77 K/s  9.5  1.3  2:42.86 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
22875 geekotest  20 B4   0     0     0     0     0 Z  614.22 K/s  5.1  0.0  0:00.08 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
31263 geekotest  20 B4   0    11  352M  192M  8880 S  468.22 K/s 10.8  1.2  1:59.89 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
25579 geekotest  20 B4   0    11  348M  187M  8880 S  463.18 K/s 23.5  1.2  2:20.43 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
  640 geekotest  20 B4   0    11  351M  187M  8856 S  430.46 K/s  7.6  1.2  1:43.48 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
 8687 postgres   20 B4   0     8  210M  133M  133M S  302.08 K/s  0.6  0.8  3h59:30 postgres: writer process
25610 postgres   20 B4   0     9  218M  152M  145M S  166.14 K/s  1.9  1.0  0:08.17 postgres: geekotest openqa [local] idle
12926 geekotest  20 B4   0    11  344M  180M  4804 S   98.17 K/s 17.8  1.1  0:38.47 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800
12415 postgres   20 B4   0     9  216M  149M  145M S   60.42 K/s  1.9  0.9  0:03.72 postgres: geekotest openqa [local] idle
12395 geekotest  20 B4   0    10  331M  168M  4804 S   57.90 K/s 25.4  1.1  0:41.37 /usr/bin/perl /usr/share/openqa/script/openqa prefork -m production --proxy -i 100 -H 400 -w 30 -G 800

which looks like there are quite some connections to workers transferring with good bandwidth, probably uploading. However the postgres: geekotest openqa [local] SELECT could be something fishy but also that process just disappeared afterwards so not that bad.

Actions #10

Updated by okurz over 4 years ago

By now, Saturday morning, the load has decreased. OSD has a mean load of below 4.0 . Also currently there are no jobs in uploading . I am reviewing new test failures but so far no severe openQA or infrastructure related problems. I can see that all jobs that had problems in uploading, i.e. ones that went incomplete in the end, were restarted and all in the end succeeded so I am not even sure if we have any regression here or just even more tests that were scheduled causing an overload. Especially s390x-kvm-sle12 seems to lag behind and still has some scheduled jobs.

I have the feeling that the problem that we originally wanted to fix with "caching" is not really fixed but moved to the "uploading" side. (IMHO – as I originally raised for caching – this is a problem that openQA should not even originally have tried to solve but the infrastructure. If it would be for me I would not have any caching nor chunk uploading in openQA.)

Ideas I have for improvement:

  • Stresstests for openQA, e.g. start 20 workers which all want to upload in parallel
  • Serialize the "create_hdd" scenarios to prevent too many parallel jobs
  • Limit the number of concurrent upload jobs on the websockets server side
    • Actually reduce the numbers of possible apache workers to prevent I/O overload
    • Queue up jobs for uploading by letting them wait before "Uploading" until there is enough ressource on the webUI host available again
  • Allow longer time lingering in "Uploading", e.g. Increase the timeout within the websocket connections (if this makes sense)
  • Improve feedback for jobs in "Uploading"
    • prevent the repeating auto-reloads of the webpage
    • provide the current status from logs over the webUI
  • Think about more efficient use of assets:
    • Why would a worker host need to upload images to another machine when afterwards the image is only used on the same worker host?
    • Reuse caching for uploading as well?
Actions #11

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback

there is nothing I am currently doing except for monitoring.

Actions #12

Updated by okurz over 4 years ago

Discussed with mkittler and he created #55112 in response. Nothing more is planned on short-term so we could even reduce prio again but I think keeping it in "Feedback" on "Urgent" is also ok for now because we all need to be aware of this problem even if not doing anything except for waiting.

Actions #13

Updated by kraih over 4 years ago

I've looked into the chunked upload code a bit, and it does look fairly inefficient. If that's a reason for this issue i'm not sure. But since nobody dares touch the code, it will need to be cleaned up at some point anyway. The worker would also benefit a lot from making the uploader non-blocking, allowing for status updates to the WebSocket server at all times. This would be a new epic ticket though, and okurz made it clear that he would like to collect more data before we do anything, so i'll hold off for now.

Actions #14

Updated by okurz over 4 years ago

Discussed the topic during the team meetup on 2019-08-06 and we have elaborated the point "limit concurrent update jobs on server side" as well as a necessary rework of uploading code vs. downloading.

Actions #15

Updated by okurz over 4 years ago

  • Copied to action #55154: osd reporting monitoring alerts about "CRITICAL - Socket timeout after 16 seconds Service: HTTP access openqa" and "CRITICAL - I/O stats tps=2108.80 KB_read/s=322876.80 KB_written/s=100.80 iowait=0.88 Service: iostat openqa" added
Actions #16

Updated by okurz over 4 years ago

  • Copied to action #55238: jobs with high amount of log files, thumbnails, test results are incompleted but the job continues with upload attempts added
Actions #17

Updated by okurz over 4 years ago

  • Copied to action #55313: long running (minutes) postgres SELECT calls on osd added
Actions #18

Updated by okurz over 4 years ago

Still there is very high load on osd. Trying more severe measures to decrease it now: Stopping the service "telegraf" as that daemon seems to have quite some CPU usage and we are currently closely monitoring the instance manually anyway.

Actions #19

Updated by okurz over 4 years ago

after the a bit more lazy weekend osd is back to high load in the 20's, sometimes 30's. As "collectd" is now reporting the error "write_graphite plugin: Connecting to openqa-monitoring.qa.suse.de:2003 via tcp failed. The last error was: failed to connect to remote host: No route to host" and it is very much visible in the CPU time I am also stopping the service for now:

systemctl stop collectd

IMHO both telegraf and collectd should run with higher nice level (CPU+I/O). What could also be done is to run rsyncd with a higher nice same also for vsftpd as the services directly on the instance should have priority over the non-interactive remote access.

Actions #20

Updated by riafarov over 4 years ago

In the latest build of SLE 12 job didn't get incomplete, but image was not properly published: https://openqa.suse.de/tests/3245595/file/autoinst-log.txt
This looks worse than before as appears that everything is fine and then it's visible only on child jobs, which should not be triggered at all.

[2019-08-14T01:52:10.0383 UTC] [error] [pid:18763] Failed uploading chunk
[2019-08-14T01:52:10.0384 UTC] [error] [pid:18763] Error uploading sle-12-SP5-aarch64-Build0265-Server-DVD@aarch64-gnome-encrypted.qcow2: 502 response: Proxy Error
[2019-08-14T01:52:10.0384 UTC] [error] [pid:18763] Upload failed for chunk 837
Actions #21

Updated by okurz over 4 years ago

I am trying to reduce the overload by stopping workers, e.g. I stopped all worker instances on openqaworker9 now:

okurz@openqa:~> sudo salt -C 'G@roles:worker' cmd.run 'systemctl status openqa-worker@\* | grep -c "Active: active"'
…
openqaworker9.suse.de:
    24
okurz@openqa:~> sudo salt -C 'openqaworker9*' cmd.run 'systemctl stop openqa-worker.target'
openqaworker9.suse.de:
okurz@openqa:~> sudo salt -C 'G@roles:worker' cmd.run 'systemctl status openqa-worker@\* | grep -c "Active: active"'
…
openqaworker9.suse.de:
    0

Also I temporarily disabled the build view on the index page with an override template:

openqa:/etc/openqa/templates # cat main/index.html.ep
% layout 'bootstrap';
% title '';

%= include 'layouts/info'

% content_for 'ready_function' => begin
    setupIndexPage();
    hideNavbar();
    alignBuildLabels();
% end

<div class="jumbotron">
  <div class='container'>
    <div class="row">
      <div class="col-md-9">
        %= include_branding 'docbox'
      </div>
      <div class="col-md-3 hidden-sm-down">
        %= include_branding 'sponsorbox'
      </div>
    </div>
  </div>
</div>

<div id="build-results">
Temporarily disabled due to performance reasons. See <a href="https://progress.opensuse.org/issues/54902">https://progress.opensuse.org/issues/54902</a> for details
</div>

The page still does not load very fast, probably the lookup of job groups is also not that efficient.

I nice'd rsyncd, salt-master, obslisten, smbd, vsftpd, cupsd (?), salt-minion

Actions #22

Updated by okurz over 4 years ago

Also stopped openqaworker6 now. Problem is that they are re-enabled. Need to change salt config: https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/8f933532b2abca2c324dbb4d3f438464afda6edd but haven't yet restarted workers or something as there so many jobs running.

EDIT (after the night): The change seems to be effective now, less worker instances running

Also stopped many instances on openqaworker13 with https://gitlab.suse.de/openqa/salt-pillars-openqa/commit/d4fb47ae2d7ba74a5aa55ae50a9b919bc2dc933a as I have the feeling this helps.

I created an infra ticket to ask if the nr. of CPUs can be increased.

Actions #23

Updated by okurz over 4 years ago

https://openqa.suse.de/tests/3256416 is a recent example:

openqa:/home/okurz # grep 3256416 /var/log/openqa
[2019-08-16T06:15:17.0953 CEST] [warn] dead job 3256416 aborted and duplicated 3257259
[2019-08-16T06:15:23.0943 CEST] [info] Got status update for job 3256416 with unexpected worker ID 1123 (expected no updates anymore, job is done with result incomplete)
…

cloned https://openqa.suse.de/tests/3257259 is fine but the overall upload process took 1.5h as can be seen in https://openqa.suse.de/tests/3257259/file/worker-log.txt

Actions #24

Updated by okurz over 4 years ago

Seems that at least one person (asmorodskyi) was confused with my message that I put on the index page so I changed the text to "Build results on the index page are temporarily disabled due to performance reasons, Use /tests or the job group pages as usual" instead.

Actions #25

Updated by okurz over 4 years ago

  • Related to action #55628: [qam][incomplete] test fails in bind with "zypper -n si bind' failed with code 105" but logs look fine? added
Actions #26

Updated by okurz over 4 years ago

As often, could be the problems we observe are older than the ticket reporting time. I suspect the "worker rework" from Martchus as biggest trigger, e.g. also see #55628 about problems to install packages from osd repos from 21 days ago. But – as always – could also be we had underlying problems, e.g. limited performance of osd, that are just made more obvious or more severe with recent code changes.

Actions #27

Updated by okurz over 4 years ago

As multiple persons were mistaken by the index page replacement we have disabled it again for now. Apparently some have not read the text :(

EDIT: Crosschecking with @mkittler we found out that I did not use it to the full extent as the template simply did not render results which can help a little bit with less perl code executed but the bunch of time saving is when we skip over the complete query in the backend code.

Actions #28

Updated by okurz over 4 years ago

  • Copied to action #55691: ask Engineering Infrastructure to change monitoring ping check target from / to /tests to reduce performance impact added
Actions #29

Updated by okurz over 4 years ago

coolo has pointed out that the obs sync plugin was using very inefficient, blocking calls to external tools which can clog the mojo workers completely until the requests are completed which can lead to the severe timeouts as we have seen. Good that the still open PR https://github.com/os-autoinst/openQA/pull/2225 did not really make that obvious :(

Actions #30

Updated by coolo over 4 years ago

This wasn't the culprit though. I pimped up our monitoring, but it didn't help.

The disk just stalls without warning - Rudi and I can see the same on the underlying VM host and it turns out the link to the netapp generates a lot of Tx errors. Let's hope this can be fixed.

Actions #31

Updated by coolo over 4 years ago

  • Status changed from Feedback to Resolved

Quoting Rudi:

turned out to be NIC errors on the connection
netapp01 port e3a <-> nx5672up-srv1-d-4a port 2/11

the stalls were caused by the error transmissions and after retransmit timeout
the packet was sent again and things went on.

cleaned both ends of the FC cable, no new errors since about 30 mins now

Actions #32

Updated by okurz over 4 years ago

  • Status changed from Resolved to In Progress

great that we found a deeper solution that helps. If you don't mind, I will keep the ticket open for now until all things we also reverted or temporarily switched off have brought back again, e.g. the reduced worker count.

Actions #33

Updated by okurz over 4 years ago

  • Status changed from In Progress to Feedback

bringing back all workers: https://gitlab.suse.de/openqa/salt-pillars-openqa/merge_requests/187

Ideas I would like to discuss first before closing:

Actions #34

Updated by coolo over 4 years ago

collectd and cups are gone and not sure what obslisten is.

vsftpd, smbd and rsyncd have all in common that they are generally very low on CPU usage, so a nice level won't help you. The IO they're doing is going to doom you if your disk bandwith is low - and nice level won't help you there as virtual disks use deadline IO scheduler.

Leaves telegraf - it's using quite a lot of CPU as it's parsing quite some log files all the time. But in case something goes wrong, it's our only hope to find out what's going wrong. It's not hogging the CPU whatsoever as it runs in intervals - and if needed we can enlarge the interval, but increasing its nice level would only result in 'we know something was strange at 14:00 as the load increased but after that we can't say what happened as we have no data'. See https://en.wikipedia.org/wiki/Schrödinger%27s_cat for details

Actions #35

Updated by okurz over 4 years ago

coolo wrote:

[…] not sure what obslisten is.

the obs rsync plugin or daemon (?).

Leaves the suggestions from #54902#note-10

Actions #36

Updated by mkittler over 4 years ago

  • Target version changed from Current Sprint to Done
Actions #37

Updated by mkittler over 4 years ago

  • Target version changed from Done to Current Sprint
Actions #38

Updated by JERiveraMoya over 4 years ago

  • Copied to action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs added
Actions #39

Updated by JERiveraMoya over 4 years ago

  • Copied to deleted (action #56447: openqaworker-arm-2 is out-of-space on /was: openQA on osd fails with empty logs)
Actions #40

Updated by okurz over 4 years ago

  • Status changed from Feedback to Resolved

no response to the open points so I have recorded them in new tickets then #56594 and #56591

Actions #41

Updated by okurz over 4 years ago

This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: minimal+base+sdk+proxy_SCC-postreg
https://openqa.suse.de/tests/3388694

Actions #42

Updated by okurz almost 2 years ago

  • Related to coordination #112718: [alert][osd] openqa.suse.de is not reachable anymore, response times > 30s, multiple alerts over the weekend added
Actions

Also available in: Atom PDF