Remote Clusters¶

See also: The Ultimate Guide to Distributed Computing with a MWE.

Preparation¶

Get an account on the cluster, such as UNC's longleaf.
Generate an ssh key, which allows you to log on without entering passwords.
Try this out by logging into the cluster via ssh. At the terminal, enter ssh user@longleaf.unc.edu.

Installing Julia on a Cluster¶

This is for cases where one does not want to use the version that is installed for everyone (usually because it lags behind the current version).

The command line installation instructions for linux produce a directory of the format julia-1.6.1 with a binary of bin/julia.

All it then takes is to replace the generic julia -e with ~/julia-1.6.1/bin/julia -e in the command files called from slurm.

Better: Install juliaup and let it handle julia version installations.

Getting started with a test script¶

How to get your code to run on a typical Linux cluster?

Get started by writing a simple test script (Test3.jl) so we can test running from the command line.
Make sure you can run the test script locally with julia “/full/path/to/Test3.jl”
Now copy Test3.jl to a directory on the cluster and repeat the same.
Once: make Julia available on the cluster with module add julia or module add julia/1.5.3 if you want a specific version.
Then run julia "/full/path/to/Test3.jl"

Now we know that things run on the cluster and it's time to submit a batch file:

sbatch -p general -N 1 -J "test_job" -t 3-00 --mem 16384 -n 1 --mail-type=end --mail-user=lhendri@email.unc.edu -o "test1.out" --wrap="julia /full/path/to/Test3.jl"

Slurm¶

Submitting jobs¶

The usual way of submitting jobs consists of writing an sbatch file and then submitting it using the sbatch command.

Steps:

Copy your code and all of its dependencies to the cluster (see below). This is not needed when all dependencies are registered.
Write a Julia script that contains the startup code for the project and then runs the actual computation (call this batch.jl).
Write a batch file that submits julia batch.jl as a job to the cluster's job scheduler. For UNC's longleaf cluster, this would be slurm. So you need to write job.sl that will be submitted using sbatch job.sl.

Each line in the sbatch file looks like #SBATCH -o value.

Options (indicated by -o) are: * -t 03-00: time in days-hours * -N 1: number of nodes * --mem 24576: memory in megabytes (per cpu)

Status of running jobs:¶

squeue -u
squeue --job XXXX
sacct --format="JobID,JobName%30,State,ExitCode" (best typed using KeyboardMaestro)

Status of completed jobs¶

Jobs can be retrieved by start time if they no longer show up in the job list (via squeue or sacct). Example sacct --starttime=2023-07-21.

Examining memory and cpu usage¶

Yale guide

After jobs completed: - seff jobid - sacct with MaxRSS switch shows memory usage.

[July 2023] Sometimes, jobs run out of memory on the cluster, even though they run fine on a local machine. This problem appears to occur when garbage collection does not run as needed. One solution: trigger garbage collection manually when memory is low with

if Sys.free_memory()/2^30 < 6.0 # gigabytes
    GC.gc()
end

This may trigger garbage collection all the time (at least on my local mac free_memory is always close to zero). And it does not help in my case.

The command line option --heap-size-hint=15G is supposed to trigger garbage collection when about 15GB of memory are allocated. That seems to help.

Errors¶

From time to time, github asks for user credentials when trying to download private repos, even if those have been downloaded many times before. Then precompile the package from the REPL on the cluster, entering the credentials by hand. They will then be stored for some time again.

Enter the personal access token instead of the account password. Sometimes (!) this must be typed into the terminal. Pasting may not work (why not?). Unfortunately, one cannot see what (if anything) has been pasted or typed. A truly moronic design.

The Julia script¶

Submitting a job is (almost) equivalent to julia batch.jl from the terminal.

Note: cd() does not work in these command files. To include a file, provide a full path.

If you only use registered packages, life is easy. Your code would simply say:

using Pkg
# This needs to be done only once, but it does not hurt
Pkg.add(MyPackage)
# Make sure all required packages are downloaded
Pkg.instantiate()
MyPackage.run()

If the code for MyPackage has been copied to the remote, then

julia --project="/path/to/MyPackage" --startup-file=no batch.jl activates MyPackage and runs batch.jl. The --project option is equivalent to Pkg.activate.

Julia looks for batch.jl in the MyPackage directory.
Disabling the startup-file prevents surprises where the startup-file changes the directory before looking for batch.jl.
~ is not expanded when relative paths are used.

If MyPackage contains is unregistered or contains unregistered dependencies, things get more difficult. Now batch.jl must:

Activate the package's environment.
develop all unregistered dependencies. This replaces the invalid paths to directories on the local machine (e.g. /Users/lutz/julia/...) with the corresponding paths on the cluster (e.g. /nas/longleaf/...). Note: I verified that one cannot replace homedir() with ~ in Manifest.toml.
using MyPackage
MyPackage.run()

Developing MyPackage in a blank folder does not work (for reasons I do not understand). It results in errors indicating that dependencies of MyPackage could not be found.

This approach requires you to keep track of all unregistered dependencies and where they are located on the remote machine. My way of doing this is contained in PackageTools.jl in the shared repo (this is not a package b/c its very purpose is to facilitate loading of unregistered packages). But the easier way is to create a private registry and register all dependencies.

File Transfer¶

A reliable command line transfer option is rsync (on mac / linux). The command would be something like

rsync -atuzv "/someDirectory/sourceDir/" "username@longleaf.unc.edu:someDirectorySourceDir"

Notes:

The source dir should end in “/”; the target dir should not.
Exluding .git speeds up the transfer.
--delete ensures that no old files remain on the server.
This will use ssh for authentication if it is set up.

An alternative is to use git.

To transfer an individual file: run(scp $filename hostname:/path/to/newfile.txt')`.