High-Performance Computing Setup – Part I - Network File System and Module Management Configuration
Install nfs-kernel-server
on your head node and nfs-common
on your compute node.
For head node
apt install nfs-kernel-server
For compute node
apt install nfs-common
Configure the shared dir from your head node by modifying /etc/exports
.
/opt 192.168.1.0/24(ro,sync,no_subtree_check)
/home 192.168.1.0/24(rw,sync,no_subtree_check)
/scratch 192.168.1.0/24(rw,sync,no_subtree_check)
Run the following command to make the settings take effect.
/sbin/exportfs -ra
Mount the shared dir in compute nodes.
mount j35a:/opt /opt
mount j35a:/home /home
mount j35a:/scratch /scratch
Setup auto mount by editing /etc/fstab
.
On head node we need to auto mount the external hard drive to scratch
folders. UUID of the external hard drive can be found using lsblk -f /dev/sda1
.
UUID=fd2391a4-72ca-4926-bd1c-8bd0d44a4448 /scratch ext4 defaults 0 2
UUID=2cf23a6c-2524-4fa1-963e-4d3465b99008 /scratch2 ext4 defaults 0 2
On compute nodes we need to auto mount the shared dirs (opt, home and scratch) from head node (ip:192.168.1.1)
192.168.1.1:/opt /opt nfs defaults,_netdev 0 0
192.168.1.1:/home /home nfs defaults,_netdev 0 0
192.168.1.1:/scratch /scratch nfs defaults,_netdev 0 0
Add user with user id and the same group id. This has to be done on very node (head node an compute nodes)
/sbin/adduser -u 1002 chenxi
/sbin/adduser -u 1003 nam
/sbin/adduser -u 1004 jianghong
/sbin/adduser -u 1005 guanming
/sbin/adduser -u 1006 chengxi
/sbin/adduser -u 1007 ningwen
/sbin/adduser -u 1008 junbo
/sbin/adduser -u 1009 haoyang
/sbin/adduser -u 1010 choi
Create a dir in shared /scratch links for every user and give them permision. This only need to be done on head node.
Example.
cd /scratch
mkdir chenxi
chown chenxi:chenxi ./chenxi
ln -s /scratch/chenxi /home/chenxi/scratch
Install slurm using the following command on both head node and compute nodes.
```sudo apt-get install slurm-wlm```
Check slurm status. On head node we need both slurmctld
and slurmd
. On compute node we only need slurmd
to run
sudo systemctl status slurmctld
sudo systemctl status slurmd
If not running we can start using
sudo systemctl start slurmctld
sudo systemctl start slurmd
You may want Slurm to auto-start after reboot
sudo systemctl enable slurmctld
sudo systemctl enable slurmd
Edit /etc/slurm/slurm.conf
as following
ClusterName=LiCPU
ControlMachine=j35a
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/log/slurm/slurmctld
SlurmdSpoolDir=/var/log/slurm/slurmd
SlurmctldPidFile=/var/log/slurm/slurmctld.pid
SlurmdPidFile=/var/log/slurm/slurmd.pid
ReturnToService=2
ProctrackType=proctrack/linuxproc
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
GresTypes=cpu
TaskPlugin=task/cgroup
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
NodeName=j35a NodeAddr=192.168.1.1 Sockets=2 CoresPerSocket=42 ThreadsPerCore=2 RealMemory=500000 State=UNKNOWN
NodeName=j35b NodeAddr=192.168.1.2 Sockets=2 CoresPerSocket=48 ThreadsPerCore=2 RealMemory=500000 State=UNKNOWN
PartitionName=Normal Nodes=ALL Default=YES MaxTime=INFINITE State=UP OverSubscribe=NO
Make sure all the files mentioned in slurm.conf editable by the user ‘slurm’. User slurm will be automatically created when install slurm.
Create/edit the /etc/slurm/cgroup.conf
as following.
CgroupAutomount=no
ConstrainCores=yes
ConstrainRAMSpace=yes
##TaskAffinity=yes
ConstrainDevices=no
Install environment-modules on your head node.
sudo apt-get install environment-modules
source /usr/share/modules/init/bash >> /etc/profile
export MODULEPATH=$MODULEPATH:/opt/module_file/