Frequently asked questions

Connecting through SSH

How do I connect to the cluster?

Please see the welcome email that was sent to you when your account was created.

I forgot my password - what now?

Please contact the support team.

How do I change my password on Star?

You can run the passwd command on the login node to change your password. Please note passwd will have no affect from the compute nodes.

What is the ssh key fingerprint for the Star cluster?

The SHA256 key fingerprint is: SHA256:W0NKVfQBl5FeOlOkoEIKIVsp1+47yIvzJAYMx6ECpwM

If you are more of a visual person, run ssh -p 5010 -o VisualHostKey=yes [login node].star.hofstra.edu and compare it to this visual key:

+--[ED25519 256]--+
|E + ++  . .o=.+=.|
|o=.=o..o . . +ooo|
|* +o .. o o  .o+ |
|+o    .. +    =  |
|..   .  S o    o |
| .    .  o .     |
|  o... ..        |
| ..+o o          |
|  .oo. .         |
+----[SHA256]-----+
  

How do I access the compute nodes?

To access the compute nodes, you need to run sbatch or srun on a login node to submit a job.

If you are looking for the same experience as with ssh, you can use srun to launch a remote interactive shell. For more information and an example, please see Interactive jobs.

How can I run a GUI-based application on the cluster?

To run software with a graphical user interface (GUI), you need to have an X server installed on your computer (such as Cygwin-X on Windows or XQuartz on Mac) and need to connect to the cluster with display forwarding.

First login to Star with display forwarding. For example:
```
$ ssh -Y [login node].star.hofstra.edu
```
Then you should reserve resources with display forwarding through the queuing system. E.g.:
```
$ srun -N 1 -t 1:0:0 --pty bash -I     # reserve and log in on a compute node
```

Passwords

I forgot my password - what now?

Please contact the support team.

How do I change my password on Star?

You can run the passwd command on the login node to change your password. Please note passwd will have no affect from the compute nodes.

Working with jobs

What are jobs?

A job is a unit of work that is submitted to a scheduler or workload manager for execution. A job consists of a shell script or executable, the job's resource requirements specification (e.g. number of processors, memory, time limit), input data, specification of where output should be written to, and environment configuration. Jobs are submitted through a scheduler, which queues them and runs them on the compute nodes. Please read more at the Job Overview page.

How do I submit a job?

Jobs are submitted by running the sbatch or srun commands. You would create a job script to load any needed modules and setup the environment needed for your program.

Where can I find example job scripts?

You can find examples at Submitting jobs or on the cluster at /fs1/shared/docs/examples/.

When will my job start?

To find out approximately when the job scheduler thinks your job will start, use the command:

squeue --start -j <job_id>

This command will give you information about how many CPUs your job requires, for how long, as well as when approximately it will start and complete. It must be emphasized that this is just a best guess, queued jobs may start earlier because of running jobs that finishes before they hit the walltime limit and jobs may start later than projected because new jobs are submitted that get higher priority.

How can I see the queing situation?

In the command line, see the job queue by using squeue.

For more commands, please see Monitoring jobs.

Why does my job not start or give me error feedback when submitting?

Most often the reason a job is not starting is that Star is full at the moment and there are many jobs waiting in the queue. But sometimes there is an error in the job script and you are asking for a configuration that is not possible on Star. In such a case the job will not start.

To find out how to monitor your jobs and check their status see Monitoring jobs.

Below are a few cases of why jobs don't start or error messages you might get:

Memory per core

"When I try to start a job with 2GB of memory pr. core, I get the following error: sbatch: error: Batch job submission failed: Requested node configuration is not available With 1GB/core it works fine. What might be the cause to this?"

On Star we have two different configurations available; 16 core and 20 core nodes - with both a total of 32 GB of memory/node. If you ask for full nodes by specifying both number of nodes and cores/node together with 2 GB of memory/core, you will ask for 20 cores/node and 40 GB of memory. This configuration does not exist on Star. If you ask for 16 cores, still with 2GB/core, there is a sort of buffer within Slurm no allowing you to consume absolutely all memory available (system needs some to work). 2000MB/core works fine, but not 2 GB for 16 cores/node.

The solution we want to push in general is this:

#SBATCH -ntasks=80 # (number of nodes * number of cores, i.e. 5*16 or 4*20 = 80)
  

If you then ask for 2000MB of memory/core, you will be given 16 cores/node and a total of 16 nodes. 4000MB will give you 8 cores/node - everyone being happy. Just note the info about PE accounting; mem-per-cpu 4000MB will cost you twice as much as mem-per-cpu 2000MB.

Please also note that if you want to use the whole memory on a node, do not ask for 32GB, but for 31GB or 31000MB as the node needs some memory for the system itself.

Step memory limit

"Why do I get slurmstepd: Exceeded step memory limit in my log/output?"

For slurm, the memory flag is a hard limit, meaning that when each core tries to utilize more than the given amount of memory, it is killed by the slurm-deamon. For example $SBATCH --mem-per-cpu=2GB means that you maximum can use 2 GB of memory per core. With memory intensive applications like Comsol or VASP, your job will likely be terminated. The solution to this problem is to specify the number of tasks irrespectively of cores/node and ask for as much memory you will need.

For instance:

#SBATCH --ntasks=20
#SBATCH --time=0-24:05:00
#SBATCH --mem-per-cpu=6000MB
  

QOSMaxWallDurationPerJobLimit

QOSMaxWallDurationPerJobLimit means that MaxWallDurationPerJobLimit has been exceeded. Basically, you have asked for more time than allowed for the given QOS/Partition.

Priority vs. Resources

Priority means that resources are in principle available, but someone else has higher priority in the queue. Resources means the at the moment the requested resources are not available.

How can I run many short tasks?

The overhead in the job start and cleanup makes it unpractical to run thousands of short tasks as individual jobs on Star.

The queueing setup on Star, or rather, the accounting system generates overhead in the start and finish of a job of about 1 second at each end of the job. This overhead is insignificant when running large parallel jobs, but creates scaling issues when running a massive amount of shorter jobs. One can consider a collection of independent tasks as one large parallel job and the aforementioned overhead becomes the serial or unparallelizable part of the job. This is because the queuing system can only start and account one job at a time. This scaling problem is described by Amdahls Law.

If the tasks are extremly short (e.g. less than 1 second), you can use the example below.

If you want to spawn many jobs without polluting the queueing system, please have a look at array jobs.

By using some shell trickery one can spawn and load-balance multiple independent task running in parallel within one node, just background the tasks and poll to see when some task is finished until you spawn the next:

#!/usr/bin/env bash

# Jobscript example that can run several tasks in parallel.
# All features used here are standard in bash so it should work on
# any sane UNIX/LINUX system.
# Author: roy.dragseth@uit.no
#
# This example will only work within one compute node so let's run
# on one node using all the cpu-cores:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=20

# We assume we will (in total) be done in 10 minutes:
#SBATCH --time=0-00:10:00

# Let us use all CPUs:
maxpartasks=$SLURM_TASKS_PER_NODE

# Let's assume we have a bunch of tasks we want to perform.
# Each task is done in the form of a shell script with a numerical argument:
# dowork.sh N
# Let's just create some fake arguments with a sequence of numbers
# from 1 to 100, edit this to your liking:
tasks=$(seq 100)

cd $SLURM_SUBMIT_DIR

for t in $tasks; do
  # Do the real work, edit this section to your liking.
  # remember to background the task or else we will
  # run serially
  ./dowork.sh $t &

  # You should leave the rest alone...

  # count the number of background tasks we have spawned
  # the jobs command print one line per task running so we only need
  # to count the number of lines.
  activetasks=$(jobs | wc -l)

  # if we have filled all the available cpu-cores with work we poll
  # every second to wait for tasks to exit.
  while [ $activetasks -ge $maxpartasks ]; do
    sleep 1
    activetasks=$(jobs | wc -l)
  done
done

# Ok, all tasks spawned. Now we need to wait for the last ones to
# be finished before we exit.
echo "Waiting for tasks to complete"
wait
echo "done"
  

And here is the dowork.sh script:

#!/usr/bin/env bash

# Fake some work, $1 is the task number.
# Change this to whatever you want to have done.

# sleep between 0 and 10 secs
let sleeptime=10*$RANDOM/32768

echo "Task $1 is sleeping for $sleeptime seconds"
sleep $sleeptime
echo "Task $1 has slept for $sleeptime seconds"
  

Source: HPC-UiT FAQ

Installing software

I need Python package X but the one on Star is too old or I cannot find it.

You can choose different Python versions throguh the module system or by using Anaconda (Miniforge).

In cases where this still doesn't solve your problem or you would like to install a package yourself, please read the next section below about installing without sudo rights.

If we don't have it installed, and installing it yourself is not a good solution for you, please contact us and we will do our best to help you.

How can I get sudo access?

Due to the cluster's architecture and security model, root or sudo access is restricted and standard users cannot perform operations that require root access. However, most standard tasks do not actually require root privledges or have non-root alternatives anyway. Please learn about using environment modules and virtual environments. If there is any other task that appears to require sudo access, please submit a support request or contact the HPC support team to assist you with your needs.

How can I install packages?

To install packages, you can create a virtual environments or install them into the home directory. Setting up a virtual environment allows you to isolate project dependencies, prevent version conflicts, and ensure your environment is reproducible if sharing or collaboration is necessary.

Python Example

Below are directions on using a simple virtual environment with venv. It is one of the many packages that allow you to manage virtual environments along with conda, virtualenv, and others.

Create a virtual environment:

python3 -m venv research1

Activate the environment:

source research1/bin/activate

Your current line should be prefixed with the environment name:

(research1) user@super-computer

Install new package:

pip install package_name

Deactivate the environment if it's active:

deactivate

R example

Below are directions on using a simple virtual environment with renv. It is one of the main packages used to manage virtual environments in R.

Create a virtual environment:

renv::init()

Activate the environment:

renv::activate()

Install new package:

install.packages("package_name")
  

Deactivate the environment if it's active:

renv::deactivate()

For more information on setting up development environments for other languages, including Julia, NodeJS, C, C++, and Rust, please see the virtual environments guide.

Compute and storage quota

How can I check my disk quota and disk usage?

To check the quota of the main project storage (parallel file system - /fs1/projects/<project>), you can use this command:

$ mmlsquota -j <project_name> fs1

The -j option specifies that you are querying a fileset, which is how quotas are set on different directories in GPFS.

How many CPU hours have I spent?

This command gives you a report of account utilization, including CPU hours, for the specified period.

$ sreport cluster AccountUtilizationByUser start=YYYY-MM-DD end=YYYY-MM-DD