HPC-UGent system status

The status of all systems can now also be consulted at http://status.vscentrum.be/

Status

Tier-2 UGent
Login nodes: Maintenance extended
Tier-2 webportal: Maintenance extended
Compute nodes: Maintenance extended
Shared filesystems: Maintenance extended
Maintenance scheduled 28/11 - 2/12/2022, but extended (see below for more info)
Tier-1 Hortense (VSC)
Login nodes: Available
Tier1-1 webportal: Available
Compute nodes: Available
Scratch filesystem: Available
(Access to Tier-1 compute requires an approved project - see https://www.vscentrum.be/compute)
Tier-1 Cloud (VSC): Available
(Access to Tier-1 cloud requires an approved project - see https://www.vscentrum.be/cloud)
VSC account page: Available

Known issues

[Mon 28 Nov 2022] Tier-1 webportal and compute nodes were unavailable for users with UGent affiliation

  • The Tier-1 webportal and compute nodes were unavailable for all users with a vsc4xxxx style vsc-id.
  • Both were affected by the unavailability of NFS links to the UGent Tier-2 shared storage, which was down for maintenance.
  • [Mon 28 Nov 2022 17h00] Problems have been fixed, Tier-1 is again fully functional for all vsc4xxxx users.

Planned maintenance

Extended scheduled maintenance work to Tier2 systems HPC-UGent (started 28 November 2022)

Unavailable:

  • all Tier2 login nodes
  • all Tier2 clusters
  • the shared storage

We will mainly perform updates in this maintenance window, including firmware, shared filesystems, OS, Slurm and the central software stack provided through modules.

Downtime extended

We are confronted with unexpected filesystem/network stability issues, making the clusters unreliable at the moment.
The Tier2 login nodes and all Tier2 clusters will remain unavailable until the problem is resolved.

  • What is going on?
    During routine stress tests, we are seeing filesystem issues that can trigger an entire cluster to go offline.
    Likely, this is the result of a bug in the parallel filesystem software, in unique combination with the kernel and network software stack versions.
    In accord with our vendor IBM we are debugging this issue and hope to fix it as soon as possible.
    At the same time, we are researching and preparing other scenarios to bring the clusters back up again.
  • Is the data safe on the HOME/DATA/SCRATCH filesystems?
    The integrity of the storage is not at risk.
    Only the clusters are unusable at the moment because of this bug.
  • Why wasn’t this tested beforehand?
    It was.
    We have migrated to a fully supported software stack.
    And tested this software stack thoroughly well before the maintenance.
    However, this bug seemingly is triggered by a unique I/O pattern that only occurs in a real live system.

Updates

  • [Fri 2/12] Extension of downtime announced
  • [Mon 5/12] Over the weekend, several tests have run possibly providing info to pinpoint the bug.
    We are further analyzing and in communication with vendor IBM.
  • [Tue 6/12] There has been progress in collecting info to debug the issue.
    In some way, it is reassuring that now also other sites are reporting similar issues.
    We are currently awaiting a fix from vendor IBM.
    However, we do not expect the clusters to be available again before the end of this week (9 Dec).
    In the mean time, we are exploring under which conditions jobs could still run without triggering the bug.

Reminders

  • [May 2022] All HPC-UGent login nodes and Tier-2 clusters have been migrated to the Red Hat Enterprise Linux 8 operating system.
  • [Wed 9 June 2021] New job command wrappers
    We switched to new job command wrappers for all HPC-UGent Tier-2 clusters.
    This switch should be transparant: you don't need to change your workflow or job scripts.
  • [Wed 27 May 2020] All SSH public keys uploaded before 20 May 2020 have been revoked.
    More information regarding this security operation at https://docs.vscentrum.be/en/latest/security_measures_20200520.html

Contact

For issues regarding Tier-2 UGent systems, contact hpc@ugent.be

For issues regarding Tier-1 Compute (VSC), contact compute@vscentrum.be

For issues regarding Tier-1 Cloud (VSC), contact cloud@vscentrum.be

Cluster load of Tier-2 UGent systems

Consult http://hpc.ugent.be/clusterstate/

(only available within the UGent network)