Research Computing and Data

Summer 2024 Maintenance is Complete!

We are excited to announce that the Summer 2024 maintenance work was completed successfully. All RCD services have been restored and are ready for users to access.

During the maintenance period, we made the following improvements:

  • Critical updates to network and storage infrastructure were completed.
    • These improvements have improved performance and stability for all users of the cluster.
  • GitLab was updated to a new major version, v17.2.2.
  • ColdFront was updated to v1.9.0.
    New features include:
    • Ability to request a GitLab group.
    • GitLab group provides enhanced features for organizing projects and repositories under a common namespace, simplifying permission management and collaboration on larger, team-based initiatives.
    • PIs can create GitLab groups by requesting a GitLab Group Allocation on ColdFront.
    • To learn more, please see the GitLab Group Allocation documentation page.
  • Slurm was upgraded to v24.05.2.
    • Spack MPI modules built with the old version of Slurm will need to be rebuilt.
  • Some MPI-enabled modules were moved in the module hierarchy.
    • The following were affected:
      • fftw/3.3.10
      • hdf5/1.14.3
      • netcdf-c/4.9.2
      • netcdf-cxx4/4.3.1
      • netcdf-fortran/4.6.1
      • netlib-scalapack/2.2.0
      • osu-micro-benchmarks/7.3
      • parallel-netcdf/1.12.3
    • To load any moved modules, you must first load the openmpi/5.0.1 module.
  • Additionally, some AMD-optimized MPI modules were moved in the module hierarchy.
    • The following were affected:
      • amdfftw/4.1
      • amdscalapack/4.1
      • lammps/20231121
    • To load these packages, you must first load both the openmpi/5.0.1 and aocc/4.1.0 modules.
  • These changes were made to better reflect the dependencies of certain packages and distinguish between MPI and non-MPI packages of the same version:
    • HDF5 was recompiled with support for C++, Fortran, High-Level APIs (hl).
      • 2 versions of HDF5 are available, one with MPI and one without.
    • hdf5/1.14.2 was upgraded to hdf5/1.14.3 (non-MPI)
  • CUDA-Aware Open MPI was rebuilt with better support for multi-node GPU detection when using Kokkos.

We appreciate your patience during the maintenance period and hope that these changes will improve the user experience.

If you have any questions or have encountered post-maintenance issues, please let us know by submitting a support ticket.

Upcoming Summer 2024 Maintenance

The RCD team has scheduled a brief maintenance window at the end of the Summer 2024 semester.

The maintenance work will begin on Monday, August 12th, 2024 at 9:00 am. All RCD services, including the Palmetto Cluster and the Indigo Data Lake, will be unavailable until the maintenance work is complete.

During the maintenance period, we plan to complete the following:

  • Our system administrators will perform critical updates to the network and storage systems for Palmetto 2.
    • These updates will improve performance and stability for all users on the cluster.
    • These changes are not directly user-facing and should not break any existing user workflows.
  • Slurm will be updated to a new version.
    • Packages built with MPI will be updated to link with the updated Slurm MPI libraries.

This page will be updated with more details about the work completed after maintenance is over.

Users should expect that RCD services will be restored no earlier than Wednesday, August 14th, 2024 at 9:00 am and should monitor email for updates from RCD.

Please feel free to reach out to us with any questions or concerns that you have about this notice by submitting a support ticket – our team would love to hear from you!

New AI/ML Nodes Available

RCD is excited to announce that we have added five new powerful compute nodes to the Palmetto 2 cluster. The hardware in the new nodes is optimized for supporting large Artificial Intelligence (AI) and Machine Learning (ML) workflows.

Each node has the following hardware specifications:

  • Model: Dell PowerEdge XE9680
  • CPU: 2x Intel Xeon Platinum 8470
  • Memory: 1 TB
  • Networking:
    • 8x NDR (3.2Tbit aggregate) Internode Communication
    • 200Gbit NDR200 Infiniband for Storage
    • 100Gb Ethernet
  • GPU: 8x NVIDIA Tesla H100 80G

These are available in the cluster for immediate use by Clemson users. We encourage you to make the most of these new resources!

Introducing the new Palmetto 2 cluster!

The RCD team is excited to announce that the new Palmetto 2 cluster is now online and ready for use. This marks the first major step in our transition from PBS to Slurm.

The logo for the new Palmetto 2 cluster.

For those who need assistance with the transition, we have prepared a Slurm Migration Guide that explains the most important differences between the two clusters. We are also offering Palmetto 2 Onboarding sessions, which provide a live walkthrough of how to use the new cluster.

Please note that access to Palmetto 2 is controlled by ColdFront, our new allocation management system. Faculty members can create projects on ColdFront and grant other users on their projects access to Palmetto 2. Students will need to ask their faculty advisor or course instructor for assistance with this process. To get started, please see the Palmetto 2 Accounts page on our documentation.

We have also launched a New Open OnDemand instance for Palmetto 2, located at ondemand.rcd.clemson.edu.

If you have any questions or need further assistance, please do not hesitate to reach out to RCD by submitting a support ticket.

Upcoming Spring 2024 Maintenance Work

The RCD team has scheduled a maintenance window to complete major changes to the Palmetto Cluster and other systems at the end of the Spring semester.

This work will begin on Monday, May 6th, 2024, at 9:00 am. While maintenance work is in progress, all RCD services, including the Palmetto Cluster and the Indigo Data Lake, will be unavailable.

During this maintenance window, the RCD team will complete the following updates, which may have user impact:

  1. The Palmetto 2 (Slurm) cluster will move into general availability.
  2. Additional nodes will move into Palmetto 2:
    • All nodes in owner queues
    • All nodes from HDR phases
  3. Our new allocation management system, ColdFront, will become available.
    • Current Palmetto 1 (PBS) accounts do not grant access to Palmetto 2.
    • Current Palmetto 1 users must use ColdFront to request new allocations to make use of Palmetto 2
    • No new accounts will be added to Palmetto 1 (PBS). 
  4. /scratch1 and /fastscratch will be decommissioned and will no longer be available.
    • All data on /scratch1 and /fastscratch will be erased.
  5. /scratch will be re-initialized.
    • All data on /scratch will be erased.
  6. ZFS systems are being decommissioned.
    • ZFS storage owners have been contacted about transitioning to the new storage system.
    • All data stored on ZFS file systems will be migrated to the Indigo Data Lake, so no data will be lost.
    • If you are a ZFS storage owner and have not received an email from us, please reach out to us.
  7. A new software module system will be introduced for Palmetto 2 (Slurm). This system will provide a more user-friendly and efficient way to manage software installations and versions.
  8. A refreshed Open OnDemand interface will be available for Slurm. The OnDemand interface will be updated to provide a more modern and user-friendly experience for Slurm users.
  9. A new job monitoring and visualization tool, jobstats, will be deployed across the cluster. This new tool allows users to monitor their jobs more easily and efficiently and will replace many existing monitoring methods.

Users should expect that services will be restored no earlier than Friday, May 13th, 2024, at 5:00 pm and should monitor their email for updates from RCD.

We understand that these changes are significant and want to help users transition smoothly. RCD will make updated documentation, training/tutorial sessions, and additional support resources available after maintenance.

Please feel free to reach out to RCD with any questions or concerns that you have about the maintenance work by submitting a support ticket – we would love to hear from you!

RCD Town Hall on March 15th, 2024

The Research Computing and Data (RCD) team plans to hold a town hall event on Friday, March 15th, 2024 from 3-4 p.m. via Zoom. Join us to hear updates about:

  • Palmetto Maintenance (tentatively May 6-May 10)
  • Introducing new RCD Staff
  • All new nodes and owner nodes moving to SLURM in May
  • ZFS File system will be retired this summer
  • Reminder about the upcoming decommission of /scratch1 and /fastscratch this summer
  • Introducing the GLOBUS enterprise license for moving large data sets easily
  • … and more!

Any current student or employee is welcome to participate.  Please register and you will receive a confirmation email containing information about joining the meeting through Zoom.

Update: Did you miss the Town Hall? You can watch the recording or view the slides online.
Note: Users must log in with through Clemson SSO to view these resources, which will be available until April 14th, 2024.

New CPU Nodes

We are excited to announce that RCD has finalized an order for 65 new CPU nodes that will be installed in the Slurm Beta cluster.

Each new node has the following specifications:

  • CPU: 2x AMD EPYC 9654 – 96-core per socket, 192 cores total
  • Memory: 768 GB
  • Network: 200G NDR200 IB, 25G Ethernet

These nodes will replace many of our aging C1 nodes, which are the oldest in the cluster. At the time of writing, the C1 portion of the cluster has a total of 9,680 CPU cores, and 7,044 of those cores will be removed over the next several months. Nodes from phases 3, 4, and 6 will be removed this winter, and nodes from phases 1a, 1b, 5c, and 5d will be removed later. Phases 2a and 2c will remain available.

Our new purchase will provide 12,480 cores, so there will be a net gain in the total amount of CPU cores and memory in the cluster. In addition, the new nodes will feature the latest generation of CPU architecture, and we expect a vast improvement in single-core performance compared to the removed nodes. Once the nodes are installed, our team will provide benchmarks.

RCD will make up to 35 nodes available for sale to interested faculty members who would like priority access. Please see our purchasing guide for pricing and details.

We expect the new nodes to be installed and available in the Slurm Beta cluster early in the Spring 2024 semester.