Remote Clusters¶
See also: The Ultimate Guide to Distributed Computing with a MWE.
Preparation¶
-
Get an account on the cluster, such as UNC's
longleaf
. -
Generate an
ssh
key, which allows you to log on without entering passwords. -
Try this out by logging into the cluster via
ssh
. At the terminal, enterssh user@longleaf.unc.edu
.
Installing Julia on a Cluster¶
This is for cases where one does not want to use the version that is installed for everyone (usually because it lags behind the current version).
The command line installation instructions for linux produce a directory of the format julia-1.6.1
with a binary of bin/julia
.
All it then takes is to replace the generic julia -e
with ~/julia-1.6.1/bin/julia -e
in the command files called from slurm
.
Better: Install juliaup and let it handle julia version installations.
Getting started with a test script¶
How to get your code to run on a typical Linux cluster?
- Get started by writing a simple test script (Test3.jl) so we can test running from the command line.
- Make sure you can run the test script locally with
julia â/full/path/to/Test3.jlâ
- Now copy Test3.jl to a directory on the cluster and repeat the same.
- Once: make Julia available on the cluster with
module add julia
ormodule add julia/1.5.3
if you want a specific version. - Then run
julia "/full/path/to/Test3.jl"
Now we know that things run on the cluster and it's time to submit a batch file:
sbatch -p general -N 1 -J "test_job" -t 3-00 --mem 16384 -n 1 --mail-type=end --mail-user=lhendri@email.unc.edu -o "test1.out" --wrap="julia /full/path/to/Test3.jl"
Slurm¶
Submitting jobs¶
The usual way of submitting jobs consists of writing an sbatch file and then submitting it using the sbatch
command.
Steps:
- Copy your code and all of its dependencies to the cluster (see below). This is not needed when all dependencies are registered.
- Write a Julia script that contains the startup code for the project and then runs the actual computation (call this
batch.jl
). - Write a batch file that submits
julia batch.jl
as a job to the cluster's job scheduler. For UNC's longleaf cluster, this would be slurm. So you need to writejob.sl
that will be submitted usingsbatch job.sl
.
Each line in the sbatch file looks like #SBATCH -o value
.
Options (indicated by -o) are:
* -t 03-00
: time in days-hours
* -N 1
: number of nodes
* --mem 24576
: memory in megabytes (per cpu)
Status of running jobs:¶
- squeue -u
- squeue --job XXXX
sacct --format="JobID,JobName%30,State,ExitCode"
(best typed using KeyboardMaestro)
Status of completed jobs¶
Jobs can be retrieved by start time if they no longer show up in the job list (via squeue
or sacct
). Example sacct --starttime=2023-07-21
.
Examining memory and cpu usage¶
After jobs completed:
- seff jobid
- sacct
with MaxRSS
switch shows memory usage.
[July 2023] Sometimes, jobs run out of memory on the cluster, even though they run fine on a local machine. This problem appears to occur when garbage collection does not run as needed. One solution: trigger garbage collection manually when memory is low with
if Sys.free_memory()/2^30 < 6.0 # gigabytes
GC.gc()
end
This may trigger garbage collection all the time (at least on my local mac free_memory
is always close to zero). And it does not help in my case.
The command line option --heap-size-hint=15G
is supposed to trigger garbage collection when about 15GB of memory are allocated. That seems to help.
Errors¶
From time to time, github
asks for user credentials when trying to download private repos, even if those have been downloaded many times before. Then precompile the package from the REPL on the cluster, entering the credentials by hand. They will then be stored for some time again.
Enter the personal access token instead of the account password. Sometimes (!) this must be typed into the terminal. Pasting may not work (why not?). Unfortunately, one cannot see what (if anything) has been pasted or typed. A truly moronic design.
The Julia script¶
Submitting a job is (almost) equivalent to julia batch.jl
from the terminal.
Note: cd() does not work in these command files. To include a file, provide a full path.
If you only use registered packages, life is easy. Your code would simply say:
using Pkg
# This needs to be done only once, but it does not hurt
Pkg.add(MyPackage)
# Make sure all required packages are downloaded
Pkg.instantiate()
MyPackage.run()
If the code for MyPackage has been copied to the remote, then
julia --project="/path/to/MyPackage" --startup-file=no batch.jl
activates MyPackage and runs batch.jl
. The --project
option is equivalent to Pkg.activate.
- Julia looks for
batch.jl
in theMyPackage
directory. - Disabling the startup-file prevents surprises where the startup-file changes the directory before looking for batch.jl.
~
is not expanded when relative paths are used.
If MyPackage contains is unregistered or contains unregistered dependencies, things get more difficult. Now batch.jl must:
- Activate the package's environment.
- develop all unregistered dependencies. This replaces the invalid paths to directories on the local machine (e.g. /Users/lutz/julia/...) with the corresponding paths on the cluster (e.g. /nas/longleaf/...). Note: I verified that one cannot replace homedir() with ~ in Manifest.toml.
- using MyPackage
- MyPackage.run()
Developing MyPackage in a blank folder does not work (for reasons I do not understand). It results in errors indicating that dependencies of MyPackage could not be found.
This approach requires you to keep track of all unregistered dependencies and where they are located on the remote machine. My way of doing this is contained in PackageTools.jl
in the shared repo (this is not a package b/c its very purpose is to facilitate loading of unregistered packages). But the easier way is to create a private registry and register all dependencies.
File Transfer¶
A reliable command line transfer option is rsync
(on mac / linux). The command would be something like
rsync -atuzv "/someDirectory/sourceDir/" "username@longleaf.unc.edu:someDirectorySourceDir"
Notes:
- The source dir should end in â/â; the target dir should not.
- Exluding .git speeds up the transfer.
--delete
ensures that no old files remain on the server.- This will use
ssh
for authentication if it is set up.
An alternative is to use git
.
To transfer an individual file: run(
scp $filename hostname:/path/to/newfile.txt')`.