
Fehler slurm_receive_msg: Zero Bytes were transmitted or received

Ursache: munge auth fehlerhaft

Node down

scontrol update NodeName=NODENAME State=RESUME


  • slurmctl-Server down -→ Knoten (und Jobs) laufen weiter
  • mysql-Daemon down -→ kein Submit von Jobs, aber laufende Jobs brechen nicht ab

Partitionen (Queues) und Gruppenbeschränkungen

  • Partitionen können in der slurm.conf auf Gruppen (bzw. Accounts) beschränkt werden -→ Nutzer, die nicht der Gruppe angehören, sehen die Partition gar nicht erst
  • Versucht ein Nutzer in eine für ihn nicht freigegebene Queue zu submitten, bekommt er den Hinweis, dass er für diese Queue nicht freigegeben ist


  • ACLs: Benutzer - Account - Partition
  • check MPI
  • → routing queue
  • ACLs: miid-db sync
  • test GPU job
  • job preemption
  • SLURM Energy Accounting Plugin


  • use priority FIFO scheduler with backfill
  • sort queued jobs after priority with 5 factors: Age, Fair-share, size, Partition, QOS
  • select nodes using consumable resources with CORES and MEMORY
SelectType =select/cons_res
SelectTypeParameters =CR_Core_Memory
  • manage a node's resources via cgroups
  • track processes with cgroups
  • GPU with general resources (gres)
  • preemption for bulldozer nodes: jobs scheduled on partition 'amd' may preempt running jobs
  • acounting via slurm db
  • Consumable Resource Allocation Plugin: select/cons_res
  • Multi-factor Job Priority Plugin
# Activate the Multi-factor Job Priority Plugin with decay
# 2 week half-life
# The job's age factor reaches 1.0 after waiting in the
# queue for 2 weeks.


Zufallsfunde: Im slurm.conf
  • Es muessen Ports verabredet werden, sonst gibts kein slurm-eigenes mpi. E.g. MpiParams=ports=112233-123456
  • Fuer automatisches Anlaufen der Knoten nach Reboots reicht '1' nicht aus, sondern es braucht ReturnToService=2

Am manchen Stellen steht AdminLevel={0,1,2} - Setzen geht nur mit
sacctmgr …user... set AdminLevel ={none|operator|admin}
Wer 'admin' ist, kann mit sview im AdminMode Reservations und Partitions erzeugen.

Partition(Queue) An/Pause/Leeren/Aus
scontrol update partitionname=NNN State={ UP | INACTIVE | DRAIN | DOWN }

##. (from the faq) What process should I follow to add nodes to Slurm?

The slurmctld daemon has a multitude of bitmaps to track state of nodes and cores in the system. Adding nodes to a running system would require the slurmctld daemon rebuild all of those bitmaps, which the developers feel would be safer to do by restarting the daemon. Communications from the slurmd daemons on the compute nodes to the slurmctld daemon include a configuration file checksum, so you probably also want to maintain a common slurm.conf file on all nodes. The following procedure is recommended:
  1. Stop the slurmctld daemon (e.g. "/etc/init.d/slurm stop" on the head node)
  2. Update the slurm.conf file on all nodes in the cluster
  3. Restart the slurmctld daemon (e.g. "/etc/init.d/slurm start" on the head node)
  4. Start the slurmd daemons on the new nodes (e.g. "/etc/init.d/slurm start" on those node)
  5. Have all slurmd daemons read the new configuration file (e.g. "scontrol reconfig", no need to restart the daemons)
