Using HPCs at UNC
When jobs crash without explanation: * check whether over disc quota. Quota is 52GB in 2020. * the reason may be errors writing to files. This terminates a process without any info on the reason in the log. * Matlab ignores file writing errors. It just keeps running. The crash only occurs when file reads fails.
Set up login via
ssh keys. Then use
rsync to upload and download files.
Changing software versions
module avail julia lists available versions.
module add julia/1.4.1 adds a module temporarily. But need to
module save to make it permanent.
module rm removes a module
module list lists currently installed ones.
Each line in the sbatch file looks like
#SBATCH -o value.
Options (indicated by -o) are: * -t 03-00: time in days-hours * -N 1: number of nodes * --mem 24576: memory in megabytes (per cpu)
Status of running jobs:
- squeue -u
- squeue --job XXXX
sacct --format="JobID,JobName%30,State,ExitCode"(best typed using KeyboardMaestro)
It is easiest to replicate the local directory structure on
killdevil. Then running files on the cluster just requires a change in the base directory that all other directories hang off.
Write your code so that the data are placed in the right folder (different on unix cluster vs local machine).
Write a matlab function that does the computations and can be called from the command line.
Command syntax: bsub -n 8 -R "span[hosts=1]" matlab -nodisplay -nosplash -r "run_batch_so1('fminsearch',1,1,7)" -logfile set7.out -n 8 : requests 8 cores - the max matlab can handle
run_batch_so1 is my command file in this example.
Kure jobs crash regularly. It is important to make sure the optimization algorithm can hot-start (resume after a crash using a saved history).
- Matlab cannot save large / complex files. It crashes. Only save the minimum optimization history needed to hot-start.
- When using parallel algorithms, make sure different instances do not try to read / write the same file at the same time. A simple semaphore approach is easily implemented (each instance locks the file it wants to access by writing a file, e.g. "lock07_param.mat" indicates that instance 7 has locked the file param.mat).