For a list of the resources available at Star HPC, take a look at About star.
Strains on the cluster occur when resources are over-requested or misallocated, leading to potential bottlenecks, decreased system performance, and extended wait times for job execution.
Common issues arise from:
Imagine a user submits a job for data analysis that only needs a minimal set of resources but requests all 8 A100 GPUs on gpu1
, anticipating that would speed up the process. Counterintuitively, this would make it take longer until the job is actually executed as the job scheduler must wait until all the requested resources are available for the job to run. Furthermore, if the job only actively uses one GPU, the remaining seven would be idle but left unavailable to other users. This is a significant overallocation of resources, leading to inefficiencies and potential delays for other tasks requiring GPU support.
A user requests more resources than are physically available on the cluster. For instance, a user submits a job that requests 9 A100 GPUs on gpu1
, where the maximum is 8. This would cause the job to fail immediately and produce an error.