About Star
Overview
The Star cluster is a High-Performance Computing (HPC) system at the Science and Innovation Center (SIC) that is designed for a variety of advanced research and computational tasks. It combines NVIDIA HGX-based compute nodes, a high-speed all-flash parallel file system-based storage system, an ultra-high-throughput/low-latency HDR200Gb/s Infiniband network fabric, and a suite of software applications. The compute nodes feature high-end H100 and A100 GPUs, AMD EPYC and Intel Xeon processors, and over 7 Terabytes of combined RAM.
The cluster runs SLURM (Simple Linux Utility for Resource Management), a job scheduler and queueing system that efficiently allocates the cluster's resources to manage competing resource demands.
Users run many different applications on the cluster based on their needs, such as Python projects via Jupyter Notebooks, OpenMPI-based parallel jobs, NetCDF (often used to manage large datasets in climatology, meteorology, oceanography, and GIS applications). Programs are run directly on the hardware (bare-metal) to maximize performance and minimize overhead.
Containerization is also increasingly popular in HPC it provides isolated environments that allow for the reuse of images for better reproducibility and software portability without the performance impact of other methods or the hastle of manualy installing dependencies. Containers are run using Apptainer (formerly Singularity), a containerization platform similar to Docker with the major difference that it runs under user privileges instead of root
. Users can deploy images from NGC (NVIDIA GPU Cloud), which provides access to a wide array of pre-built images with GPU-optimized software for diverse applications. Leveraging container images can save a lot of time as users don't need to set up the software applications from scratch and can just pull and use the NGC images with Apptainer.
Hardware
Login Node
- IBM System x3550 with 128GB RAM
- DL325 Gen10+ with an 8-core EPYC processor, 128GB RAM
- DL385 Gen10+ v2 with 2x AMD 32-core EPYC processors, 256GB RAM
Compute Nodes
HPE Apollo 6500 Gen10
Attribute\Node Name | gpu1 and gpu2 |
---|---|
Model Name | HPE ProLiant XL675d Gen10 Plus; Apollo 6500 Gen10 Plus Chassis |
Processors | AMD EPYC 7513 |
Sockets | 2 |
Cores per Socket | 32 |
Threads per Core | 2 |
Memory | 1024 GiB Total Memory (16 x 64GiB DIMM DDR4) |
GPU | 8 SXM NVIDIA A100s |
Local Storage (Scratch space) | 6.4TB (5.8TiB) SSD |
HPE DL385 Gen10
Attribute\Node Name | cn01 |
---|---|
Model Name | HPE ProLiant DL385 Gen10 Plus v2 |
Processors | AMD EPYC 7513 32-Core Processor |
Sockets | 2 |
Cores per Socket | 32 |
Threads per Core | 2 |
Memory | 256GiB Total Memory (16 x 16GiB DIMM DDR4) |
GPU | 2 SXM NVIDIA A30s |
Local Scratch | None |
HPE DL380a Gen11
Attribute\Node Name | gpu3 and gpu4 |
---|---|
Model Name | HPE DL380a Gen11 |
Processors | 64 Physical cores / 128 Logical Cores (2 x Intel Xeon-P 8462Y+ @ 2.8GHz) |
Memory | 512 GiB DDR5 RAM |
GPU | 2 NVIDIA H100 80GB GPUs (NVAIE subscription) |
Network | 4-port GbE, 1-port HDR200 InfiniBand |
Local Storage (Scratch Space) | None |
Cray XD665 Nodes
Attribute\Node Name | gpu5 and gpu6 |
---|---|
Model Name | Cray XD665 |
Processors | 64 Physical cores / 128 Logical Cores (2 x AMD EPYC Genoa 9334 @ 2.7GHz) |
Memory | 768 GiB DDR5 RAM |
GPU | 4 NVIDIA HGX H100 80GB SXM GPUs |
Network | 2-port 10GbE, 1-port HDR200 InfiniBand |
Local Storage (Scratch Space) | None |
Cray XD670 Node
Attribute\Node Name | gpu7 |
---|---|
Model Name | Cray XD670 |
Processors | 64 Physical cores / 128 Logical Cores (2 x Intel Xeon-P 8462Y+ @ 2.8GHz) |
Memory | 2048 GiB DDR5 RAM |
GPU | 8 NVIDIA HGX H100 80GB SXM GPUs |
Network | 2-port 10GbE, 1-port HDR200 InfiniBand |
Local Storage (Scratch Space) | None |
Storage System
Our storage system contains of four HPE PFSS nodes, collectively offering a total of 63TB of storage. You can think of these four nodes as one unified 63TB storage unit as it is a Parallel File System Storage component. These nodes work in parallel and are all mounted under one mount point on the gpu nodes only (/fs1
).
Our vision
Making complex and time-intensive calculations simple and accessible.
Our Goal
Our heart is set on creating a community where our cluster is a symbol of collaboration and discovery. We are wishing to provide a supportive space where researchers and students can express their scientific ideas and explore unchanted areas. We aim to make the complicated world of computational research a shared path of growth, learning, and significant discoveries for the ones that are eager to learn.
Operations Team
- Alexander Rosenberg
- Mani Tofigh
The Board
- Edward H. Currie
- Daniel P. Miller
- Adam C. Durst
- Jason D. Williams
- Thomas G. Re
- Oren Segal
- John Ortega