Project

General

Profile

Actions

action #135407

closed

coordination #110833: [saga][epic] Scale up: openQA can handle a schedule of 100k jobs with 1k worker instances

coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert

[tools] Measure to mitigate websockets overload by workers and revert it size:M

Added by osukup 8 months ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Feature requests
Target version:
Start date:
2023-09-08
Due date:
% Done:

0%

Estimated time:

Description

Motivation

Consolidate all steps we took to mitigate #135122 and how to revert it.

1) stopped workers:

used:
sudo salt 'worker3[1,2,3,4,5,6]*' cmd.run 'sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d "." -f 1 | xargs)'\
&& for i in {1..6}; do sudo salt-key -y -d "worker3$i*"; done

revert:
for i in {1..6}; do sudo salt-key -y -a "worker3$i*";done && sudo salt 'worker3[1,2,3,4,5,6]*' state.apply

2) Lowered amount workers

used:
https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/606

revert:
revert mentioned MR in GitLab

Acceptance criteria

  • AC1: Ensure step #1 has been reverted
  • AC2: DONE Ensure step #2 has been reverted

Suggestions

  • Maybe don't bring them all back at once (and be prepared to remove them again in case of new performance issues)
  • In case of new performance issues make sure to strace the openqa-scheduler and openqa-websockets processes

Related issues 2 (0 open2 closed)

Related to openQA Project - action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:MResolveddheidler

Actions
Copied to openQA Infrastructure - action #137756: Re-enable worker31 for multi-machine tests in production auto_review:"tcpdump.+check.log.+timed out at"Resolvedokurz

Actions
Actions #1

Updated by osukup 8 months ago

  • Related to coordination #135122: [epic] OSD openQA refuses to assign jobs, >3k scheduled not being picked up, no alert added
Actions #2

Updated by tinita 8 months ago

  • Target version set to Ready
Actions #3

Updated by okurz 8 months ago

  • Parent task set to #135122
Actions #4

Updated by okurz 8 months ago

  • Category set to Feature requests
  • Priority changed from Normal to High
Actions #5

Updated by livdywan 8 months ago

  • Subject changed from [tools] Measure to mitigate websockets overload by workers and revert it to [tools] Measure to mitigate websockets overload by workers and revert it size:M
  • Description updated (diff)
  • Status changed from New to Workable
Actions #6

Updated by livdywan 7 months ago

  • Status changed from Workable to In Progress

Let's start with worker31.oqa.prg2.suse.org:

sudo systemctl enable --now openqa-worker-auto-restart@{1..$(grep numofworkers /etc/openqa/workers.ini | awk {'print $3'})}.service
sudo salt-key -y -a "worker31*" && sudo salt 'worker31*' state.apply
The following keys are going to be accepted:
Unaccepted Keys:
worker31.oqa.prg2.suse.org
Key for minion worker31.oqa.prg2.suse.org accepted.
worker31.oqa.prg2.suse.org:
    Minion did not return. [Not connected]
ERROR: Minions returned with non-zero exit code

sudo salt 'worker31*' state.apply
worker31.oqa.prg2.suse.org:
    Data failed to compile:
----------
    The function "state.apply" is running as PID 64259 and was started at 2023, Sep 20 13:41:35.355380 with jid 20230920134135355380

I'm guessing the state is being applied, but the CLI somehow can't cope with it so I'll check in later.

Edit: Oh wow, it even picked up a job like instantaneously despite salt being in tears.

Actions #7

Updated by okurz 7 months ago

  • Assignee set to livdywan
  • Priority changed from High to Urgent

But please handle failures like https://openqa.suse.de/tests/12205812#step/iscsi_client/53 with urgency as this is again "multi-machine tests failing"

Actions #8

Updated by okurz 7 months ago

  • Priority changed from Urgent to Immediate

How is #135056-20 not surprising ;)

Actions #9

Updated by openqa_review 7 months ago

  • Due date set to 2023-10-05

Setting due date based on mean cycle time of SUSE QE Tools

Actions #10

Updated by okurz 7 months ago

  • Priority changed from Immediate to Urgent

Mitigation

in worker31:

sudo salt 'worker31*' cmd.run 'systemctl mask --now $(systemctl list-units | grep openqa-worker-auto-restart | cut -d "." -f 1 | xargs)'

and to handle according failed jobs

for state in failed parallel_failed; do env WORKER=worker31 host=openqa.suse.de result="result='$state'" t_finished=2023-09-20 bash -ex openqa-advanced-retrigger-jobs; done

We looked up all jobs also manually to label:

for i in $(ssh osd "sudo -u geekotest psql --no-align --tuples-only --command=\"select id from jobs where assigned_worker_id in (select id from workers where (host='worker31')) and t_finished >= '2023-09-20';\" openqa"); do openqa-cli api --osd -X post jobs/$i/comments text="poo135407"; done

covering 1943 jobs.

Actions #11

Updated by livdywan 7 months ago

https://github.com/os-autoinst/scripts/pull/261 so next time we don't have to reverse engineer what the script already does to also add comments with the ticket

Actions #12

Updated by okurz 7 months ago

openQA worker instances on worker31 were again running. I don't know what unmasked them.

Following https://gitlab.suse.de/openqa/salt-states-openqa#remarks-about-the-systemd-units-used-to-start-workers I called

sudo systemctl mask --now openqa-reload-worker-auto-restart@{1..40}.{service,path}

and also called the openqa-advanced-retrigger-jobs invocation additionally

I removed worker31 from salt now. Better disable the production worker classes first before bringing the machines back.

Actions #13

Updated by livdywan 7 months ago

  • Priority changed from Urgent to High

worker31 has been offline for 23 hours so I assume this is no longer urgent.

I removed worker31 from salt now. Better disable the production worker classes first before bringing the machines back.

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/622

Actions #14

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/622 merged, please monitor the deployment and any potential impact. Then slowly bring back the workers. We currently have 773 worker instances online on OSD, from https://openqa.suse.de/admin/workers so enabling all 40 instances on w31-36 would be 240 instances which would sum to 1013 worker instances which is above the (artificial) limit of 1k that we defined but would be a good and still low-risk test for the fixes of #135122

Actions #15

Updated by okurz 7 months ago

I have added worker31 to salt and unmasked a single worker instance no. 1 for now

EDIT: https://openqa.suse.de/tests/12274504 passed https://openqa.suse.de/admin/workers/2567 overall looks very stable.

sudo systemctl enable --now openqa-reload-worker-auto-restart@{1..40}.path openqa-reload-worker-auto-restart@{1..40}.service openqa-worker-auto-restart@{1..40}.service

Jobs on other worker instances work good, e.g. https://openqa.suse.de/tests/12274815#

Actions #17

Updated by livdywan 7 months ago

for i in {2..6}; do sudo salt-key -y -a "worker3$i*";done && sudo salt 'worker3[1,2,3,4,5,6]*' state.apply

I hadn't saved the comment. Workers are back in salt. Next step running some jobs on them.

Actions #18

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/630 merged, please monitor multi-machine production jobs on w31 and enable the other worker machines.

From SQL select distinct j.id,test,result,host,instance,j.t_created from jobs j join job_settings js on j.id = js.job_id join workers w on j.assigned_worker_id = w.id where j.t_created >= '2023-10-04' and (result = 'failed' or test = 'incomplete') and arch='x86_64' and host ~ 'worker3[1-9]*' and test !~ ':investigate:' order by j.t_created desc limit 30;

    id    |                  test                  | result |   host   | instance |      t_created      
----------+----------------------------------------+--------+----------+----------+---------------------
 12372953 | hpc_ALPHA_openmpi_mpi_slave00          | failed | worker39 |       20 | 2023-10-04 10:40:12
 12372950 | hpc_BETA_openmpi_mpi_slave01           | failed | worker30 |       41 | 2023-10-04 10:39:52
 12372946 | hpc_GAMMA_slurm_18_08_master_backup_db | failed | worker31 |       28 | 2023-10-04 10:39:13
 12372940 | hpc_ganglia_server                     | failed | worker39 |        4 | 2023-10-04 10:38:23
 12372936 | hpc_GAMMA_slurm_20_11_master_backup_db | failed | worker30 |       35 | 2023-10-04 10:38:14
 12372926 | hpc_pdsh_genders_slave                 | failed | worker30 |       49 | 2023-10-04 10:38:07
 12372865 | qam-incidentinstall                    | failed | worker33 |       33 | 2023-10-04 10:10:22
 12372785 | hpc_GAMMA_slurm_20_11_master_backup_db | failed | worker39 |        5 | 2023-10-04 10:10:13
 12372786 | hpc_ganglia_client                     | failed | worker38 |        9 | 2023-10-04 10:10:13
 12372793 | hpc_pdsh_genders_slave                 | failed | worker37 |       14 | 2023-10-04 10:10:13
 12372794 | hpc_pdsh_master                        | failed | worker30 |       32 | 2023-10-04 10:10:13
 12372746 | hpc_GAMMA_slurm_18_08_db               | failed | worker31 |       41 | 2023-10-04 10:10:12
 12372760 | hpc_GAMMA_slurm_20_11_slave00          | failed | worker30 |       49 | 2023-10-04 10:10:12
 12372765 | hpc_GAMMA_slurm_20_11_slave01          | failed | worker39 |       27 | 2023-10-04 10:10:12
 12372736 | qam-incidentinstall                    | failed | worker33 |       23 | 2023-10-04 10:10:11
 12372717 | minimal+base                           | failed | worker33 |       15 | 2023-10-04 10:07:50
 12372714 | cryptlvm_minimal_x                     | failed | worker35 |       11 | 2023-10-04 10:07:49
 12372687 | sles4sap_gnome_saptune_notes           | failed | worker30 |        4 | 2023-10-04 09:41:15
 12372688 | sles4sap_gnome_saptune_overrides       | failed | worker30 |       17 | 2023-10-04 09:41:15
 12372686 | sles4sap_gnome_saptune_delete_rename   | failed | worker30 |        5 | 2023-10-04 09:41:14
 12372683 | sles4sap_gnome_saptune_notes           | failed | worker30 |       17 | 2023-10-04 09:41:08
 12372682 | sles4sap_gnome_saptune_delete_rename   | failed | worker30 |        4 | 2023-10-04 09:41:07
 12372665 | sles4sap_gnome_saptune_solutions       | failed | worker30 |        5 | 2023-10-04 09:40:40
 12372664 | sles4sap_gnome_saptune_overrides       | failed | worker30 |       17 | 2023-10-04 09:40:39
 12372661 | sles4sap_gnome_saptune_solutions       | failed | worker30 |        4 | 2023-10-04 09:40:36
 12372658 | sles4sap_gnome_saptune_delete_rename   | failed | worker30 |        5 | 2023-10-04 09:40:32
 12372645 | sles4sap_gnome_saptune_solutions       | failed | worker30 |       17 | 2023-10-04 09:40:08
 12372644 | sles4sap_gnome_saptune_overrides       | failed | worker30 |        4 | 2023-10-04 09:40:06
 12372641 | sles4sap_gnome_saptune_solutions       | failed | worker30 |        5 | 2023-10-04 09:40:03
 12372640 | sles4sap_gnome_saptune_overrides       | failed | worker30 |       17 | 2023-10-04 09:40:02

please look into those

Actions #19

Updated by livdywan 7 months ago

please look into those

Not sure why you're including workers outside of 31-36 but I checked them in any case and found no relevant failures.

Actions #20

Updated by livdywan 7 months ago

$ for i in {001..100} ; do retry -e -- ./script/openqa-clone-job --skip-chained-deps --parental-inheritance --within-instance https://openqa.suse.de 12369569 TEST+=-$i WORKER_CLASS=tap_poo135407 BUILD=poo135407a SKIP_MAINTENANCE_UPDATES=1 _GROUP=0 ; done

I'm assuming any mix of workers is fine as this is about confirming that all remaining workers are on the same baseline. Should anything suggest they're not we can still go back to more targeted jobs.

Of course I again made a mistake and ran w/o _GROUP=0 first and cancelled some jobs... running another batch, let's see how that goes.

Actions #21

Updated by livdywan 7 months ago

Apparently systemd service were failing, so I restarted them all.

Of course I again made a mistake and ran w/o _GROUP=0 first and cancelled some jobs... running another batch, let's see how that goes.

And spelling the WORKER_CLASS as tap_135407 might help, too...

Actions #22

Updated by livdywan 7 months ago

  • Status changed from In Progress to Feedback

So far jobs are looking good, no production mm jobs yet - in case of a sudden disaster they could still be taken out using this:

sudo salt 'worker3[2,3,4,5,6]*' cmd.run 'sudo systemctl disable --now telegraf $(systemctl list-units | grep openqa-worker-auto-restart | cut -d "." -f 1 | xargs)'\

&& for i in {1..6}; do sudo salt-key -y -d "worker3$i*"; done

In case there's no disaster and depending on my non-prod mm jobs I would strive to enable mm in prod tomorrow.

Actions #23

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/635 (merged) to enable machines again for production use of multi-machine tests.

Actions #24

Updated by nicksinger 7 months ago

List of failed tests on worker3[0-6]:

https://openqa.suse.de/tests/12370582
https://openqa.suse.de/tests/12370583
https://openqa.suse.de/tests/12370594
https://openqa.suse.de/tests/12370595
https://openqa.suse.de/tests/12370598
https://openqa.suse.de/tests/12370602
https://openqa.suse.de/tests/12370603
https://openqa.suse.de/tests/12370604
https://openqa.suse.de/tests/12370607
https://openqa.suse.de/tests/12370608
https://openqa.suse.de/tests/12370610
https://openqa.suse.de/tests/12370613
https://openqa.suse.de/tests/12370615
https://openqa.suse.de/tests/12370648
https://openqa.suse.de/tests/12370673
https://openqa.suse.de/tests/12370674
https://openqa.suse.de/tests/12370675
https://openqa.suse.de/tests/12370677
https://openqa.suse.de/tests/12370679
https://openqa.suse.de/tests/12370680
https://openqa.suse.de/tests/12370681
https://openqa.suse.de/tests/12370682
https://openqa.suse.de/tests/12370717
https://openqa.suse.de/tests/12370718
https://openqa.suse.de/tests/12370980
https://openqa.suse.de/tests/12371083
https://openqa.suse.de/tests/12371084
https://openqa.suse.de/tests/12371085
https://openqa.suse.de/tests/12371086
https://openqa.suse.de/tests/12371089
https://openqa.suse.de/tests/12371090
https://openqa.suse.de/tests/12371091
https://openqa.suse.de/tests/12371093
https://openqa.suse.de/tests/12371094
https://openqa.suse.de/tests/12371095
https://openqa.suse.de/tests/12371096
https://openqa.suse.de/tests/12371097
https://openqa.suse.de/tests/12371110
https://openqa.suse.de/tests/12371253
https://openqa.suse.de/tests/12371440
https://openqa.suse.de/tests/12371594
https://openqa.suse.de/tests/12371845
https://openqa.suse.de/tests/12371954
https://openqa.suse.de/tests/12372081
https://openqa.suse.de/tests/12372330
https://openqa.suse.de/tests/12372331
https://openqa.suse.de/tests/12372332
https://openqa.suse.de/tests/12372538
https://openqa.suse.de/tests/12372539
https://openqa.suse.de/tests/12372540
https://openqa.suse.de/tests/12372542
https://openqa.suse.de/tests/12372543
https://openqa.suse.de/tests/12372544
https://openqa.suse.de/tests/12372545
https://openqa.suse.de/tests/12372550
https://openqa.suse.de/tests/12372610
https://openqa.suse.de/tests/12372614
https://openqa.suse.de/tests/12372619
https://openqa.suse.de/tests/12372620
https://openqa.suse.de/tests/12372640
https://openqa.suse.de/tests/12372641
https://openqa.suse.de/tests/12372644
https://openqa.suse.de/tests/12372645
https://openqa.suse.de/tests/12372658
https://openqa.suse.de/tests/12372661
https://openqa.suse.de/tests/12372664
https://openqa.suse.de/tests/12372665
https://openqa.suse.de/tests/12372682
https://openqa.suse.de/tests/12372683
https://openqa.suse.de/tests/12372686
https://openqa.suse.de/tests/12372687
https://openqa.suse.de/tests/12372688
https://openqa.suse.de/tests/12372714
https://openqa.suse.de/tests/12372717
https://openqa.suse.de/tests/12372736
https://openqa.suse.de/tests/12372746
https://openqa.suse.de/tests/12372760
https://openqa.suse.de/tests/12372794
https://openqa.suse.de/tests/12372865
https://openqa.suse.de/tests/12372926
https://openqa.suse.de/tests/12372936
https://openqa.suse.de/tests/12372946
https://openqa.suse.de/tests/12372950
https://openqa.suse.de/tests/12373050
https://openqa.suse.de/tests/12373079
https://openqa.suse.de/tests/12373083
https://openqa.suse.de/tests/12373084
https://openqa.suse.de/tests/12373159
https://openqa.suse.de/tests/12373162
https://openqa.suse.de/tests/12373175
https://openqa.suse.de/tests/12373181
https://openqa.suse.de/tests/12373184
https://openqa.suse.de/tests/12373192
https://openqa.suse.de/tests/12373194
https://openqa.suse.de/tests/12373197
https://openqa.suse.de/tests/12373198
https://openqa.suse.de/tests/12373202
https://openqa.suse.de/tests/12373220
https://openqa.suse.de/tests/12373246
https://openqa.suse.de/tests/12373247
https://openqa.suse.de/tests/12373258
https://openqa.suse.de/tests/12373259
https://openqa.suse.de/tests/12373263
https://openqa.suse.de/tests/12373271
https://openqa.suse.de/tests/12373272
https://openqa.suse.de/tests/12373274
https://openqa.suse.de/tests/12373276
https://openqa.suse.de/tests/12373278
https://openqa.suse.de/tests/12373279
https://openqa.suse.de/tests/12373287
https://openqa.suse.de/tests/12373299
https://openqa.suse.de/tests/12373321
https://openqa.suse.de/tests/12373364
https://openqa.suse.de/tests/12373403
https://openqa.suse.de/tests/12373404
https://openqa.suse.de/tests/12373452
https://openqa.suse.de/tests/12373498
https://openqa.suse.de/tests/12373527
https://openqa.suse.de/tests/12373534
https://openqa.suse.de/tests/12373783
https://openqa.suse.de/tests/12373784
https://openqa.suse.de/tests/12373787
https://openqa.suse.de/tests/12373798
https://openqa.suse.de/tests/12373808
https://openqa.suse.de/tests/12373817
https://openqa.suse.de/tests/12373819
https://openqa.suse.de/tests/12373820
https://openqa.suse.de/tests/12373824
https://openqa.suse.de/tests/12373831
https://openqa.suse.de/tests/12373833
https://openqa.suse.de/tests/12373843
https://openqa.suse.de/tests/12373856
https://openqa.suse.de/tests/12373861
https://openqa.suse.de/tests/12373863
https://openqa.suse.de/tests/12373864
https://openqa.suse.de/tests/12373867
https://openqa.suse.de/tests/12373868
https://openqa.suse.de/tests/12373871
https://openqa.suse.de/tests/12373873
https://openqa.suse.de/tests/12373875
https://openqa.suse.de/tests/12373876
https://openqa.suse.de/tests/12373881
https://openqa.suse.de/tests/12373886
https://openqa.suse.de/tests/12373916
https://openqa.suse.de/tests/12374010
https://openqa.suse.de/tests/12374028
https://openqa.suse.de/tests/12374054
https://openqa.suse.de/tests/12374056
https://openqa.suse.de/tests/12374057
https://openqa.suse.de/tests/12374059
https://openqa.suse.de/tests/12374079
https://openqa.suse.de/tests/12374082
https://openqa.suse.de/tests/12374086
https://openqa.suse.de/tests/12374089
https://openqa.suse.de/tests/12374095
https://openqa.suse.de/tests/12374097
https://openqa.suse.de/tests/12374099
https://openqa.suse.de/tests/12374101
https://openqa.suse.de/tests/12374103
https://openqa.suse.de/tests/12374105
https://openqa.suse.de/tests/12374110
https://openqa.suse.de/tests/12374111
https://openqa.suse.de/tests/12374113
https://openqa.suse.de/tests/12374116
https://openqa.suse.de/tests/12374117
https://openqa.suse.de/tests/12374119
https://openqa.suse.de/tests/12374121
https://openqa.suse.de/tests/12374137
https://openqa.suse.de/tests/12374149
https://openqa.suse.de/tests/12374152
https://openqa.suse.de/tests/12374153
https://openqa.suse.de/tests/12374155
https://openqa.suse.de/tests/12374189
https://openqa.suse.de/tests/12374191
https://openqa.suse.de/tests/12374193
https://openqa.suse.de/tests/12374195
https://openqa.suse.de/tests/12374207
https://openqa.suse.de/tests/12374264
https://openqa.suse.de/tests/12374267
https://openqa.suse.de/tests/12374268
https://openqa.suse.de/tests/12374270
https://openqa.suse.de/tests/12374273
https://openqa.suse.de/tests/12374276
https://openqa.suse.de/tests/12374277
https://openqa.suse.de/tests/12374280
https://openqa.suse.de/tests/12374281
https://openqa.suse.de/tests/12374284
https://openqa.suse.de/tests/12374288
https://openqa.suse.de/tests/12374291
https://openqa.suse.de/tests/12374301
https://openqa.suse.de/tests/12374310
https://openqa.suse.de/tests/12374333
https://openqa.suse.de/tests/12374334
https://openqa.suse.de/tests/12374341
https://openqa.suse.de/tests/12374348
https://openqa.suse.de/tests/12374353
https://openqa.suse.de/tests/12374355
https://openqa.suse.de/tests/12374357
https://openqa.suse.de/tests/12374359
https://openqa.suse.de/tests/12374372
https://openqa.suse.de/tests/12374441
https://openqa.suse.de/tests/12374442
https://openqa.suse.de/tests/12374459
https://openqa.suse.de/tests/12374582
https://openqa.suse.de/tests/12374603
https://openqa.suse.de/tests/12374621
https://openqa.suse.de/tests/12374637
https://openqa.suse.de/tests/12374639
https://openqa.suse.de/tests/12374642
https://openqa.suse.de/tests/12374650
https://openqa.suse.de/tests/12374651
https://openqa.suse.de/tests/12374656
https://openqa.suse.de/tests/12374673
https://openqa.suse.de/tests/12374674
https://openqa.suse.de/tests/12374682
https://openqa.suse.de/tests/12374684
https://openqa.suse.de/tests/12374706
https://openqa.suse.de/tests/12374709
https://openqa.suse.de/tests/12374712
https://openqa.suse.de/tests/12374715
https://openqa.suse.de/tests/12374723
https://openqa.suse.de/tests/12374724
https://openqa.suse.de/tests/12374725
https://openqa.suse.de/tests/12374808
https://openqa.suse.de/tests/12374817
https://openqa.suse.de/tests/12374826
https://openqa.suse.de/tests/12374833
https://openqa.suse.de/tests/12374985
https://openqa.suse.de/tests/12375090
https://openqa.suse.de/tests/12375096
https://openqa.suse.de/tests/12375102
https://openqa.suse.de/tests/12375103
https://openqa.suse.de/tests/12375105
https://openqa.suse.de/tests/12375106
https://openqa.suse.de/tests/12375107
https://openqa.suse.de/tests/12375115
https://openqa.suse.de/tests/12375116
https://openqa.suse.de/tests/12375118
https://openqa.suse.de/tests/12375120
https://openqa.suse.de/tests/12375121
https://openqa.suse.de/tests/12375123
https://openqa.suse.de/tests/12375131
https://openqa.suse.de/tests/12375150
https://openqa.suse.de/tests/12375152
https://openqa.suse.de/tests/12375154
https://openqa.suse.de/tests/12375201
https://openqa.suse.de/tests/12375203
https://openqa.suse.de/tests/12375204
https://openqa.suse.de/tests/12375205
https://openqa.suse.de/tests/12375207
https://openqa.suse.de/tests/12375208
https://openqa.suse.de/tests/12375210
https://openqa.suse.de/tests/12375211
https://openqa.suse.de/tests/12375233
https://openqa.suse.de/tests/12375279
https://openqa.suse.de/tests/12375336
https://openqa.suse.de/tests/12375450
https://openqa.suse.de/tests/12375455
https://openqa.suse.de/tests/12375456
https://openqa.suse.de/tests/12375468
https://openqa.suse.de/tests/12375504
https://openqa.suse.de/tests/12375507
https://openqa.suse.de/tests/12376262
https://openqa.suse.de/tests/12376263
https://openqa.suse.de/tests/12376264
https://openqa.suse.de/tests/12376282
https://openqa.suse.de/tests/12376284
https://openqa.suse.de/tests/12376288
https://openqa.suse.de/tests/12376302
https://openqa.suse.de/tests/12376312
https://openqa.suse.de/tests/12376315
https://openqa.suse.de/tests/12376316
https://openqa.suse.de/tests/12376317
https://openqa.suse.de/tests/12376337
https://openqa.suse.de/tests/12376382
https://openqa.suse.de/tests/12376681
https://openqa.suse.de/tests/12376688
https://openqa.suse.de/tests/12376709
https://openqa.suse.de/tests/12376717
https://openqa.suse.de/tests/12376729
https://openqa.suse.de/tests/12376735
https://openqa.suse.de/tests/12376752
https://openqa.suse.de/tests/12376758
https://openqa.suse.de/tests/12376765
https://openqa.suse.de/tests/12376775
https://openqa.suse.de/tests/12376786
https://openqa.suse.de/tests/12376804
https://openqa.suse.de/tests/12376809
https://openqa.suse.de/tests/12376815
https://openqa.suse.de/tests/12376824
https://openqa.suse.de/tests/12376845
https://openqa.suse.de/tests/12376862
https://openqa.suse.de/tests/12376867
https://openqa.suse.de/tests/12376879
https://openqa.suse.de/tests/12376892
https://openqa.suse.de/tests/12376913
https://openqa.suse.de/tests/12376941
https://openqa.suse.de/tests/12376949
https://openqa.suse.de/tests/12376961
https://openqa.suse.de/tests/12376979
https://openqa.suse.de/tests/12376993
https://openqa.suse.de/tests/12377026
https://openqa.suse.de/tests/12377064
https://openqa.suse.de/tests/12377164
https://openqa.suse.de/tests/12377179
https://openqa.suse.de/tests/12377197
https://openqa.suse.de/tests/12377200
https://openqa.suse.de/tests/12377206
https://openqa.suse.de/tests/12377211
https://openqa.suse.de/tests/12377241
https://openqa.suse.de/tests/12377288
https://openqa.suse.de/tests/12377301
https://openqa.suse.de/tests/12377314
https://openqa.suse.de/tests/12377328
https://openqa.suse.de/tests/12377332
https://openqa.suse.de/tests/12377335
https://openqa.suse.de/tests/12377343
https://openqa.suse.de/tests/12377364
https://openqa.suse.de/tests/12377368
https://openqa.suse.de/tests/12377370
https://openqa.suse.de/tests/12377375
https://openqa.suse.de/tests/12377380
https://openqa.suse.de/tests/12377383
https://openqa.suse.de/tests/12377398
https://openqa.suse.de/tests/12377465
https://openqa.suse.de/tests/12377478
https://openqa.suse.de/tests/12377482
https://openqa.suse.de/tests/12377503
https://openqa.suse.de/tests/12377504
https://openqa.suse.de/tests/12377505
https://openqa.suse.de/tests/12377509
https://openqa.suse.de/tests/12377514
https://openqa.suse.de/tests/12377541
https://openqa.suse.de/tests/12377548
https://openqa.suse.de/tests/12377591
https://openqa.suse.de/tests/12377691
https://openqa.suse.de/tests/12377706
https://openqa.suse.de/tests/12377711
https://openqa.suse.de/tests/12377718
https://openqa.suse.de/tests/12377730
https://openqa.suse.de/tests/12377732
https://openqa.suse.de/tests/12377733
https://openqa.suse.de/tests/12377770
https://openqa.suse.de/tests/12377772
https://openqa.suse.de/tests/12377774
https://openqa.suse.de/tests/12377822
https://openqa.suse.de/tests/12377831
https://openqa.suse.de/tests/12377832
https://openqa.suse.de/tests/12377998
https://openqa.suse.de/tests/12378000
https://openqa.suse.de/tests/12378010
https://openqa.suse.de/tests/12378024
https://openqa.suse.de/tests/12378026
https://openqa.suse.de/tests/12378028
https://openqa.suse.de/tests/12378030
https://openqa.suse.de/tests/12378038
https://openqa.suse.de/tests/12378039
https://openqa.suse.de/tests/12378041
https://openqa.suse.de/tests/12378086
https://openqa.suse.de/tests/12378092
https://openqa.suse.de/tests/12378093
https://openqa.suse.de/tests/12378096
https://openqa.suse.de/tests/12378098
https://openqa.suse.de/tests/12378118
https://openqa.suse.de/tests/12378127
https://openqa.suse.de/tests/12378128
https://openqa.suse.de/tests/12378166
https://openqa.suse.de/tests/12378172
https://openqa.suse.de/tests/12378174
https://openqa.suse.de/tests/12378175
https://openqa.suse.de/tests/12378177
https://openqa.suse.de/tests/12378180
https://openqa.suse.de/tests/12378339
https://openqa.suse.de/tests/12378356
https://openqa.suse.de/tests/12378375
https://openqa.suse.de/tests/12378382
https://openqa.suse.de/tests/12378384
https://openqa.suse.de/tests/12378389
https://openqa.suse.de/tests/12378392
https://openqa.suse.de/tests/12378400
https://openqa.suse.de/tests/12378401
https://openqa.suse.de/tests/12378402
https://openqa.suse.de/tests/12378416
https://openqa.suse.de/tests/12378434
https://openqa.suse.de/tests/12378439
https://openqa.suse.de/tests/12378443
https://openqa.suse.de/tests/12378447
https://openqa.suse.de/tests/12378451
https://openqa.suse.de/tests/12378453
https://openqa.suse.de/tests/12378476
https://openqa.suse.de/tests/12378507
https://openqa.suse.de/tests/12378510
https://openqa.suse.de/tests/12378513
https://openqa.suse.de/tests/12378518
https://openqa.suse.de/tests/12378522
https://openqa.suse.de/tests/12378525
https://openqa.suse.de/tests/12378556
https://openqa.suse.de/tests/12378561
https://openqa.suse.de/tests/12378573
https://openqa.suse.de/tests/12378584
https://openqa.suse.de/tests/12378586
https://openqa.suse.de/tests/12378587
https://openqa.suse.de/tests/12378590
https://openqa.suse.de/tests/12378592
https://openqa.suse.de/tests/12378595
https://openqa.suse.de/tests/12378597
https://openqa.suse.de/tests/12378758
https://openqa.suse.de/tests/12378762
https://openqa.suse.de/tests/12378792
https://openqa.suse.de/tests/12378793
https://openqa.suse.de/tests/12378794
https://openqa.suse.de/tests/12378807
https://openqa.suse.de/tests/12378844
https://openqa.suse.de/tests/12378864
https://openqa.suse.de/tests/12378871
https://openqa.suse.de/tests/12378887
https://openqa.suse.de/tests/12378917
https://openqa.suse.de/tests/12378919
https://openqa.suse.de/tests/12378920
https://openqa.suse.de/tests/12378928
https://openqa.suse.de/tests/12378950
https://openqa.suse.de/tests/12378953
https://openqa.suse.de/tests/12378968
https://openqa.suse.de/tests/12378972
https://openqa.suse.de/tests/12378973
https://openqa.suse.de/tests/12378979
https://openqa.suse.de/tests/12378980
https://openqa.suse.de/tests/12378987
https://openqa.suse.de/tests/12378990
https://openqa.suse.de/tests/12378992
https://openqa.suse.de/tests/12378997
https://openqa.suse.de/tests/12379000
https://openqa.suse.de/tests/12379006
https://openqa.suse.de/tests/12379038
https://openqa.suse.de/tests/12379043
https://openqa.suse.de/tests/12379127
https://openqa.suse.de/tests/12379135
https://openqa.suse.de/tests/12379139
https://openqa.suse.de/tests/12379141
https://openqa.suse.de/tests/12379143
https://openqa.suse.de/tests/12379145
https://openqa.suse.de/tests/12379148
https://openqa.suse.de/tests/12379155
https://openqa.suse.de/tests/12379156
https://openqa.suse.de/tests/12379160
https://openqa.suse.de/tests/12379164
https://openqa.suse.de/tests/12379165
https://openqa.suse.de/tests/12379167
https://openqa.suse.de/tests/12379170
https://openqa.suse.de/tests/12379172
https://openqa.suse.de/tests/12379174
https://openqa.suse.de/tests/12379175
https://openqa.suse.de/tests/12379179
https://openqa.suse.de/tests/12379181
https://openqa.suse.de/tests/12379187
https://openqa.suse.de/tests/12379191
https://openqa.suse.de/tests/12379194
https://openqa.suse.de/tests/12379198
https://openqa.suse.de/tests/12379201
https://openqa.suse.de/tests/12379203
https://openqa.suse.de/tests/12379206
https://openqa.suse.de/tests/12379207
https://openqa.suse.de/tests/12379221
https://openqa.suse.de/tests/12379223
https://openqa.suse.de/tests/12379225
https://openqa.suse.de/tests/12379227
https://openqa.suse.de/tests/12379232
https://openqa.suse.de/tests/12379257
https://openqa.suse.de/tests/12379261
https://openqa.suse.de/tests/12379275
https://openqa.suse.de/tests/12379285
https://openqa.suse.de/tests/12379289
https://openqa.suse.de/tests/12379301
https://openqa.suse.de/tests/12379308
https://openqa.suse.de/tests/12379318
https://openqa.suse.de/tests/12379319
https://openqa.suse.de/tests/12379345
https://openqa.suse.de/tests/12379412
https://openqa.suse.de/tests/12379414
https://openqa.suse.de/tests/12379429
https://openqa.suse.de/tests/12379431
https://openqa.suse.de/tests/12379466
https://openqa.suse.de/tests/12379468
https://openqa.suse.de/tests/12379710
https://openqa.suse.de/tests/12379719
https://openqa.suse.de/tests/12379720
https://openqa.suse.de/tests/12379733
https://openqa.suse.de/tests/12379740
https://openqa.suse.de/tests/12379753
https://openqa.suse.de/tests/12379756
https://openqa.suse.de/tests/12379757
https://openqa.suse.de/tests/12379758
https://openqa.suse.de/tests/12379759
https://openqa.suse.de/tests/12379773
https://openqa.suse.de/tests/12380429
https://openqa.suse.de/tests/12380431
https://openqa.suse.de/tests/12380590
https://openqa.suse.de/tests/12380591
https://openqa.suse.de/tests/12380593
https://openqa.suse.de/tests/12380609
https://openqa.suse.de/tests/12380613
https://openqa.suse.de/tests/12380627
https://openqa.suse.de/tests/12380632
https://openqa.suse.de/tests/12380634
https://openqa.suse.de/tests/12380674
https://openqa.suse.de/tests/12380709
https://openqa.suse.de/tests/12380719
https://openqa.suse.de/tests/12380799
https://openqa.suse.de/tests/12381128
https://openqa.suse.de/tests/12381140
https://openqa.suse.de/tests/12381145
https://openqa.suse.de/tests/12381155
https://openqa.suse.de/tests/12381165
https://openqa.suse.de/tests/12381166
https://openqa.suse.de/tests/12381170
https://openqa.suse.de/tests/12381173
https://openqa.suse.de/tests/12381176
https://openqa.suse.de/tests/12381276
https://openqa.suse.de/tests/12381541
https://openqa.suse.de/tests/12381596
https://openqa.suse.de/tests/12381634
https://openqa.suse.de/tests/12381637
https://openqa.suse.de/tests/12381643
https://openqa.suse.de/tests/12381742
https://openqa.suse.de/tests/12381755
https://openqa.suse.de/tests/12381759
https://openqa.suse.de/tests/12381790
Actions #25

Updated by nicksinger 7 months ago

I used for i in $(ssh openqa.suse.de "sudo -u geekotest psql -t --command=\"select distinct j.id from jobs j join job_settings js on j.id = js.job_id join workers w on j.assigned_worker_id = w.id where j.t_created >= '2023-10-04' and (result = 'failed' or test = 'incomplete') and arch='x86_64' and host ~ '^worker3[0-6]+' and test !~ ':investigate:';\" openqa"); do echo openqa-client --host openqa.suse.de jobs/$i/restart post; done to restart them after reverting https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/635 and https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/630

Actions #26

Updated by livdywan 7 months ago

Kinda seems like the issue described in #135524-20 came back here:

sudo salt -C 'worker3*' cmd.run 'sysctl -a | grep net.ipv..conf.br..forwarding | grep -v v6'
worker34.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 0
worker32.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 0
worker36.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 1
worker37.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 1
worker33.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 0
worker31.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 0
worker35.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 1
worker39.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 1
worker30.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 1
worker38.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 1

It's not consistently enabled. Trying to find the relevant change in salt.

Actions #27

Updated by livdywan 7 months ago

It's not consistently enabled. Trying to find the relevant change in salt.

https://gitlab.suse.de/openqa/salt-states-openqa/-/merge_requests/987#note_539351 mentions eth0 and indeed that's enabled on all of these machines, even though the change in trusted.xml concerns br1. Maybe we need both, and br1 was only manually enabled on machines so far? I'm not clear why the comment seems to verify a different interface.

Actions #28

Updated by livdywan 7 months ago

Going by #135524#note-15 and as apparently only workers where br1 is not being forwarded fail unable to resolve scc.suse.com and such I'm assuming adding the according sysctl.present states is needed in addition to the firewall configuration.

Actions #29

Updated by livdywan 7 months ago

  • Due date changed from 2023-10-05 to 2023-10-13

Since we need to ensure the workers do work, even if these are issues that existed before, I'm bumping the due date to ensure attempted fixes can be reviewed and more testing can be conducted before taking them into production again.

Actions #30

Updated by livdywan 7 months ago

  • Related to action #136013: Ensure IP forwarding is persistent for multi-machine tests also in our salt recipes size:M added
Actions #31

Updated by livdywan 7 months ago

livdywan wrote in #note-26:

Kinda seems like the issue described in #135524-20 came back here:

sudo salt -C 'worker3*' cmd.run 'sysctl -a | grep net.ipv..conf.br..forwarding | grep -v v6'
worker34.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 0
worker32.oqa.prg2.suse.org:
    net.ipv4.conf.br1.forwarding = 0

Apparently something or someone enabled forwarding on all of these machines?

Actions #32

Updated by okurz 7 months ago

I suggest to reboot all worker3* and check forwarding again, e.g.

sudo salt -C 'worker3*' cmd.run 'reboot' && until ping -c1 worker31.oqa.prg2.suse.org && sudo salt -C 'worker3*' cmd.run 'sysctl -a | grep net.ipv..conf.br..forwarding | grep -v v6'
Actions #33

Updated by livdywan 7 months ago

okurz wrote in #note-32:

I suggest to reboot all worker3* and check forwarding again, e.g.

sudo salt -C 'worker3*' cmd.run 'reboot' && until ping -c1 worker31.oqa.prg2.suse.org && sudo salt -C 'worker3*' cmd.run 'sysctl -a | grep net.ipv..conf.br..forwarding | grep -v v6'

Apparently whatever it is, it's persistent.

Test jobs are looking good, suggesting the missing forwarding was indeed the cause of the failures.

Provided no surprises come up I would once again proceed with enabling mm on those workers in production.

Actions #34

Updated by livdywan 7 months ago

Test jobs are looking good, suggesting the missing forwarding was indeed the cause of the failures.

There were failures. All of them fail like so:

# Test died: command 'tcpdump -ni any net -c 20 10.0.2.102 > check.log' timed out at /usr/lib/os-autoinst/testapi.pm line 926.
testapi::assert_script_run("tcpdump -ni any net -c 20 10.0.2.102 > check.log", 300) called at sle/lib/console/ovs_utils.pm line 36
Actions #35

Updated by okurz 7 months ago

so? What's your plan?

Actions #36

Updated by livdywan 7 months ago

okurz wrote in #note-35:

so? What's your plan?

Another batch because I'm suspicious of these failures having occurred at roughly the same time.

Actions #37

Updated by livdywan 7 months ago

livdywan wrote in #note-36:

okurz wrote in #note-35:

so? What's your plan?

Another batch because I'm suspicious of these failures having occurred at roughly the same time.

Apparently it's all worker31 in the previous batch and the latest batch.

Actions #38

Updated by livdywan 7 months ago

livdywan wrote in #note-37:

livdywan wrote in #note-36:

okurz wrote in #note-35:

so? What's your plan?

Another batch because I'm suspicious of these failures having occurred at roughly the same time.

Apparently it's all worker31 in the previous batch and the latest batch.

Indeed worker31 every single time in batch g and batch h, too. Presumably the others i.e. 32-36 can be enabled in production.

Actions #39

Updated by okurz 7 months ago

https://gitlab.suse.de/openqa/salt-pillars-openqa/-/merge_requests/648 merged, please monitor impact on production, i.e. find passed jobs with the production worker classes.

Actions #40

Updated by livdywan 7 months ago

  • Copied to action #137756: Re-enable worker31 for multi-machine tests in production auto_review:"tcpdump.+check.log.+timed out at" added
Actions #41

Updated by livdywan 7 months ago

  • Status changed from Feedback to Resolved

I reviewed various jobs and checked that there's multiple successful jobs with tap on each worker. Here's some examples:

The workers seem to take jobs in general, and regarding the primary concern of the ticket I think the load is looking great!

Actions #42

Updated by okurz 7 months ago

  • Due date deleted (2023-10-13)
Actions

Also available in: Atom PDF