action #62153
closed[hpc] test fails in slurm_master
Added by sebchlad almost 5 years ago. Updated about 4 years ago.
0%
Description
Observation¶
openQA test in scenario sle-15-SP2-Online-aarch64-hpc_GAMMA_slurm_master_db@aarch64 fails in
slurm_master
Test suite description¶
Slurm accounting tests with db configured and NFS shared folder provided. 2 ctls, multiple compute nodes. Maintainer: schlad
Reproducible¶
Fails since (at least) Build 120.1
Expected result¶
Last good: 105.4 (or more recent)
Further details¶
Always latest result in this scenario: latest
Updated by sebchlad almost 5 years ago
- Subject changed from test fails in slurm_master to [hpc] test fails in slurm_master
Updated by sebchlad almost 5 years ago
might be a bug looking at:
[2020-01-20T05:38:33.455] debug: Log file re-opened
[2020-01-20T05:38:33.463] debug: sched: slurmctld starting
[2020-01-20T05:38:33.464] debug: creating clustername file: /shared/slurm//clustername
[2020-01-20T05:38:33.465] error: chdir(/var/log): Permission denied
[2020-01-20T05:38:33.465] slurmctld version 18.08.8 started on cluster linux
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/crypto_munge.so
[2020-01-20T05:38:33.465] Munge cryptographic signature plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/preempt_none.so
[2020-01-20T05:38:33.465] preempt/none loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/checkpoint_none.so
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug: Checkpoint plugin loaded: checkpoint/none
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_energy_none.so
[2020-01-20T05:38:33.465] debug: AcctGatherEnergy NONE plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_profile_none.so
[2020-01-20T05:38:33.465] debug: AcctGatherProfile NONE plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_interconnect_none.so
[2020-01-20T05:38:33.465] debug: AcctGatherInterconnect NONE plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_filesystem_none.so
[2020-01-20T05:38:33.466] debug: AcctGatherFilesystem NONE plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2020-01-20T05:38:33.466] debug3: Trying to load plugin /usr/lib64/slurm/jobacct_gather_linux.so
[2020-01-20T05:38:33.466] debug: Job accounting gather LINUX plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux
[2020-01-20T05:38:33.466] debug3: Trying to load plugin /usr/lib64/slurm/ext_sensors_none.so
[2020-01-20T05:38:33.466] ExtSensors NONE plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] debug3: Trying to load plugin /usr/lib64/slurm/switch_none.so
[2020-01-20T05:38:33.466] debug: switch NONE plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] debug: power_save module disabled, SuspendTime < 0
[2020-01-20T05:38:33.471] debug: Requesting control from backup controller master-node01
[2020-01-20T05:38:33.472] debug2: slurm_connect failed: Connection refused
[2020-01-20T05:38:33.472] debug2: Error connecting slurm stream socket at 10.0.2.18:6817: Connection refused
[2020-01-20T05:38:33.472] error: _shutdown_bu_thread:send/recv master-node01: Connection refused
[2020-01-20T05:38:33.472] debug3: Trying to load plugin /usr/lib64/slurm/accounting_storage_slurmdbd.so
[2020-01-20T05:38:33.472] Accounting storage SLURMDBD plugin loaded
[2020-01-20T05:38:33.472] debug3: Success.
[2020-01-20T05:38:33.473] debug2: slurm_connect failed: Connection refused
[2020-01-20T05:38:33.473] debug2: Error connecting slurm stream socket at 10.0.2.16:20088: Connection refused
[2020-01-20T05:38:33.473] error: slurm_persist_conn_open_without_init: failed to open persistent connection to slave-node02:20088: Connection refused
[2020-01-20T05:38:33.473] error: slurmdbd: Sending PersistInit msg: Connection refused
[2020-01-20T05:38:33.473] debug4: slurmdbd: There is no state save file to open by name /shared/slurm//dbd.messages
[2020-01-20T05:38:33.473] debug: Association database appears down, reading from state file.
[2020-01-20T05:38:33.473] debug: create_mmap_buf: Failed to open file /shared/slurm//last_tres
, No such file or directory
[2020-01-20T05:38:33.473] debug2: No last_tres file (/shared/slurm//last_tres) to recover
[2020-01-20T05:38:33.473] debug: create_mmap_buf: Failed to open file /shared/slurm//assoc_mgr_state
, No such file or directory
[2020-01-20T05:38:33.473] debug2: No association state file (/shared/slurm//assoc_mgr_state) to recover
[2020-01-20T05:38:33.473] fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files.
Updated by sebchlad almost 5 years ago
investigating such problems would be way easier if we fail each node and then collect logs on each node, so I will try to do so first.
Updated by sebchlad almost 5 years ago
- Status changed from In Progress to Resolved
- Target version changed from 445 to 457
resolved
Updated by pcervinka about 4 years ago
- Target version changed from 457 to QE Kernel Done