Project

General

Profile

Actions

action #62153

closed

[hpc] test fails in slurm_master

Added by sebchlad almost 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Bugs in existing tests
Target version:
QE Kernel - QE Kernel Done
Start date:
2020-01-15
Due date:
% Done:

0%

Estimated time:
Difficulty:

Description

Observation

openQA test in scenario sle-15-SP2-Online-aarch64-hpc_GAMMA_slurm_master_db@aarch64 fails in
slurm_master

Test suite description

Slurm accounting tests with db configured and NFS shared folder provided. 2 ctls, multiple compute nodes. Maintainer: schlad

Reproducible

Fails since (at least) Build 120.1

Expected result

Last good: 105.4 (or more recent)

Further details

Always latest result in this scenario: latest

Actions #1

Updated by sebchlad almost 5 years ago

  • Subject changed from test fails in slurm_master to [hpc] test fails in slurm_master
Actions #2

Updated by sebchlad almost 5 years ago

might be a bug looking at:
[2020-01-20T05:38:33.455] debug: Log file re-opened
[2020-01-20T05:38:33.463] debug: sched: slurmctld starting
[2020-01-20T05:38:33.464] debug: creating clustername file: /shared/slurm//clustername
[2020-01-20T05:38:33.465] error: chdir(/var/log): Permission denied
[2020-01-20T05:38:33.465] slurmctld version 18.08.8 started on cluster linux
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/crypto_munge.so
[2020-01-20T05:38:33.465] Munge cryptographic signature plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/preempt_none.so
[2020-01-20T05:38:33.465] preempt/none loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/checkpoint_none.so
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug: Checkpoint plugin loaded: checkpoint/none
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_energy_none.so
[2020-01-20T05:38:33.465] debug: AcctGatherEnergy NONE plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_profile_none.so
[2020-01-20T05:38:33.465] debug: AcctGatherProfile NONE plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_interconnect_none.so
[2020-01-20T05:38:33.465] debug: AcctGatherInterconnect NONE plugin loaded
[2020-01-20T05:38:33.465] debug3: Success.
[2020-01-20T05:38:33.465] debug3: Trying to load plugin /usr/lib64/slurm/acct_gather_filesystem_none.so
[2020-01-20T05:38:33.466] debug: AcctGatherFilesystem NONE plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
[2020-01-20T05:38:33.466] debug3: Trying to load plugin /usr/lib64/slurm/jobacct_gather_linux.so
[2020-01-20T05:38:33.466] debug: Job accounting gather LINUX plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux
[2020-01-20T05:38:33.466] debug3: Trying to load plugin /usr/lib64/slurm/ext_sensors_none.so
[2020-01-20T05:38:33.466] ExtSensors NONE plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] debug3: Trying to load plugin /usr/lib64/slurm/switch_none.so
[2020-01-20T05:38:33.466] debug: switch NONE plugin loaded
[2020-01-20T05:38:33.466] debug3: Success.
[2020-01-20T05:38:33.466] debug: power_save module disabled, SuspendTime < 0
[2020-01-20T05:38:33.471] debug: Requesting control from backup controller master-node01
[2020-01-20T05:38:33.472] debug2: slurm_connect failed: Connection refused
[2020-01-20T05:38:33.472] debug2: Error connecting slurm stream socket at 10.0.2.18:6817: Connection refused
[2020-01-20T05:38:33.472] error: _shutdown_bu_thread:send/recv master-node01: Connection refused
[2020-01-20T05:38:33.472] debug3: Trying to load plugin /usr/lib64/slurm/accounting_storage_slurmdbd.so
[2020-01-20T05:38:33.472] Accounting storage SLURMDBD plugin loaded
[2020-01-20T05:38:33.472] debug3: Success.
[2020-01-20T05:38:33.473] debug2: slurm_connect failed: Connection refused
[2020-01-20T05:38:33.473] debug2: Error connecting slurm stream socket at 10.0.2.16:20088: Connection refused
[2020-01-20T05:38:33.473] error: slurm_persist_conn_open_without_init: failed to open persistent connection to slave-node02:20088: Connection refused
[2020-01-20T05:38:33.473] error: slurmdbd: Sending PersistInit msg: Connection refused
[2020-01-20T05:38:33.473] debug4: slurmdbd: There is no state save file to open by name /shared/slurm//dbd.messages
[2020-01-20T05:38:33.473] debug: Association database appears down, reading from state file.
[2020-01-20T05:38:33.473] debug: create_mmap_buf: Failed to open file /shared/slurm//last_tres, No such file or directory
[2020-01-20T05:38:33.473] debug2: No last_tres file (/shared/slurm//last_tres) to recover
[2020-01-20T05:38:33.473] debug: create_mmap_buf: Failed to open file /shared/slurm//assoc_mgr_state, No such file or directory
[2020-01-20T05:38:33.473] debug2: No association state file (/shared/slurm//assoc_mgr_state) to recover
[2020-01-20T05:38:33.473] fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files.

Actions #3

Updated by sebchlad almost 5 years ago

investigating such problems would be way easier if we fail each node and then collect logs on each node, so I will try to do so first.

Actions #4

Updated by sebchlad almost 5 years ago

  • Status changed from In Progress to Resolved
  • Target version changed from 445 to 457

resolved

Actions #5

Updated by pcervinka about 4 years ago

  • Target version changed from 457 to QE Kernel Done
Actions

Also available in: Atom PDF