(Previously known as "Odyssey Related")
Real-time cluster status:
iFrame | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Table of Contents |
---|
HTML |
---|
<script src="https://gist-it.appspot.com/github/PackardChan/harvard-cluster-monitor/blob/master/output/motd?slice=18:-1"></script> |
This page describes the Harvard FAS Research Computing cluster.
Quick start guide: https://docs.rc.fas.harvard.edu/kb/quickstart-guide/
There is an "Introduction to the Cluster course" that new users are required to attend within 45 days of account issue. The online training and quiz can be accessed here:
https://docs.rc.fas.harvard.edu/kb/introduction-to-cluster-online/
Please check out the companion GitHub page on customizing cluster account, which is positioned to host scripts, and be more general to any computer cluster.
The followings are some notes further to the above links.
https://docs.rc.fas.harvard.edu/kb/cluster-storage/
Cost: https://www.rc.fas.harvard.edu/services/data-storage/#Offerings_Tiers_of_Service
As an example, below are available disks for Kuang's group (update on every Monday). Kuang's group members can run ~pchan/git/harvard-cluster-monitor/script/df.sh
to see the latest usage (should take only several seconds).
HTML |
---|
<script src="https://gist-it.appspot.com/github/PackardChan/harvard-cluster-monitor/blob/master/output/df-txt"></script> |
df
is the command to show hardware limits, lfs quota
is the command to show software quotas.lfs quota -hu $USER some_disk
is a quick way to see your usage.du -h some_folder
to see the usage of that folder (can take hours for large number of files). home directory: 100G per person
/n/home??/.snapshot/rc_homes_*/${USER}/ stores snapshot of home directory regularly.
/scratch: only visible for each compute node, not shared with other nodes
The following nodes might have difficulties in accessing the following disks (update at around noon).
HTML |
---|
<script src="https://gist-it.appspot.com/github/PackardChan/harvard-cluster-monitor/blob/master/output/disk-bad-node"></script> |
https://docs.rc.fas.harvard.edu/kb/unix-permissions/
https://www2.cisl.ucar.edu/user-support/setting-file-and-directory-permissions
[pchan@boslogin04 ~]$ ls -ld /n/home05/pchan
drwxr-xr-x 78 pchan kuang_lab 3860 Nov 1 22:42 /n/home05/pchan
In the string "drwxr-xr-x", the 1st char (d) says it is a directory.
2nd-4th chars (rwx) describe the permissions for user (pchan). All permissions read (r), write (w) and execute/search (x) are given here.
5th-7th chars (r-x) describe the permissions for group (kuang_lab). Read (r) and execute/search (x) permissions are given here, but not write (w) permission.
8th-10th chars (r-x) describe the permissions for others.
For directories, you often need both r and x permissions to read its contents.
id pchan
getent group kuang_lab
limit by parent directory
Beware group, e.g. transfer space
[pchan@datamover01 ~]$ ls -ld /n/holylfs/INTERNAL_REPOS/CLIMATE_MODELS/
drwxrws--- 9 root huce 4096 Oct 9 23:45 /n/holylfs/INTERNAL_REPOS/CLIMATE_MODELS/
sticky bit
ACL access control list
https://www2.cisl.ucar.edu/resources/storage-and-file-systems/glade/using-access-control-lists
List files that could be cleaned in scratchlfs:
lfs find /n/scratchlfs/`id -gn`/${USER}/ -mtime +88 -type f -print
Transfer
RC is moving to Globus for large-scale transfers. datamover01 will be less supported. (9/29/2020)
Globus transfer (https://docs.rc.fas.harvard.edu/kb/globus-file-transfer/)
Harvard FAS RC Holyoke
(or Harvard FAS RC Boston
for Boston disks), click on the Continue button, log in with your RC username and 6-digit code./n/holylfs/TRANSFER/$USER
, /n/holylfs02/TRANSFER/$USER
and /n/boslfs/TRANSFER/$USER
are created. Files in these transfer space do count towards the group quota for the same disk.Using compute node (some info out-dated):
Transfer data within the cluster (https://docs.rc.fas.harvard.edu/kb/transferring-data-on-the-cluster/)
(rsync can be parallelized, ask RC how. (?https://github.com/fasrc/slurm_migration_scripts))
sbatch -p huce_intel -c 8 -t 1440 --mem-per-cpu=1000 --open-mode=append --wrap='fpsync -n $SLURM_CPUS_PER_TASK -o "-ax" -O "-b" "/n/kuanglfs/pchan/jetshift/" "/n/holylfs/LABS/kuang_lab/pchan/jetshift/"' # 13T, 4h54m
Using datamover01 (some info out-dated):
To copy a lot of files: (https://docs.rc.fas.harvard.edu/kb/globus-file-transfer/)
ssh datamover01
rsync -avu source_dir dest_dir # DO NOT add -z !!
They are now in:
/n/holylfs04/LABS/kuang_lab/Lab/$USER
/n/holylfs04/LABS/kuang_lab/Lab/kuanglfs/$USER
/n/holylfs04/LABS/kuang_lab/Lab/kuangfs1/$USER
/n/holylfs04/LABS/kuang_lab/Lab/kuang100/01/$USER, etc.
https://docs.rc.fas.harvard.edu/kb/running-jobs/#Slurm_partitions
https://docs.rc.fas.harvard.edu/kb/huce-partitions/
Below are available job queues for Packard (excluding *_requeue, update on every Monday). Run ~pchan/git/harvard-cluster-monitor/script/sinfo.sh
to see job queues available to you (should take only several seconds).
HTML |
---|
<script src="https://gist-it.appspot.com/github/PackardChan/harvard-cluster-monitor/blob/master/output/sinfo-exclude-requeue"></script> |
Melissa Sulprizio created a Google group for HUCE partition users. (4/28/2020)
Convenient Slurm commands: https://docs.rc.fas.harvard.edu/kb/convenient-slurm-commands/
If you are familiar with PBS or other job schedulers, here is a good comparison: https://slurm.schedmd.com/rosetta.pdf
https://docs.rc.fas.harvard.edu/kb/quickstart-guide/#Run_a_batch_job
RC recommends srun -p test --pty --x11=first --mem 500 -t 0-08:00 /bin/bash
to start an interactive session. However, this interactive session will be not responsive after one hour of inactivity. Work around: ssh to the allocated node (instead of logged in by srun), there will be no timeout for this ssh session. Unlike srun, most environment variables are not transmitted through ssh, you will need to reload the modules, cd to your working directory and set any other environment variable you need.
Personally, I use salloc -p huce_intel -n 1 -t 0-12:00 --mem=30000
to allocate resources. Salloc opens a new shell on local machine, with SLURM_* variables set up. Then I run ssh -Y $SLURM_JOB_NODELIST
, to ssh to the allocated node. Once you finish, you can exit twice, to logout compute node and to exit the salloc shell.
xbash: alias function defined in my bashrc that submits an interactive job. You can modify to submit it in a different partition.
sacct
squeue
You can also use sacct, squeue, scontrol show, scontrol update, sinfo and other slurm commands. But do note that squeue is modified by the rc to reduce load on the scheduler.
nodeinfo: function defined in ~pchan/.bashrc that gives current 'partition-level' status of the huce partitions.
~pchan/bin/lsload.pl gives current 'node-level' status of non-full nodes in huce partitions.
Usage: lsload.pl; OR lsload.pl huce_intel
RC's advice on submitting large numbers of jobs
If you are submitting a lot of small jobs, that will take up the whole huce_intel overnight. I propose to exclude several nodes as a fast lane. This perhaps may mean several more minutes for you on top of an overnight job, but will mean one less night for those who are running small jobs.
Opening up a fast lane:
squeue -u $USER -t PD --noheader -o "%13i %.9P %.24j %.8u %.2t %.10M %.4C %.3D %R" |awk '{print $1}' |xargs -I {} scontrol update jobid={} ExcNodeList=`cat /n/home05/pchan/sw/crontab/node-fastlane`
Cost: https://www.rc.fas.harvard.edu/services/cluster-computing/#Offerings_Tiers_of_Service
https://rc.fas.harvard.edu/wp-content/uploads/2016/03/Troubleshooting-Jobs-on-Odyssey.pdf
Within a slurm environment (including remote desktop), where environment variables SLURM* are set, srun will be interfered. If you don't mean to create a job step, please at least unset SLURM_JOBID SLURM_JOB_ID
.
Favorite modules in Kuang's group:
module load matlab #/R2017a-fasrc02
# load matlab first: avoid "undefined reference to `ncdimdef'", only conflict with netcdf/4.1.3
module load intel/17.0.4-fasrc01 impi/2017.2.174-fasrc01 netcdf/4.1.3-fasrc02
module load libpng/1.6.25-fasrc01 # for WRF grib2
module load jasper/1.900.1-fasrc02# for WRF grib2
module load perl-modules/5.10.1-fasrc13# for CESM
module load nco/4.7.4-fasrc01
module load ncview/2.1.7-fasrc01
module load ncl_ncarg/6.4.0-fasrc01module load grads/2.0.a5-fasrc01
Best place to search for modules: https://portal.rc.fas.harvard.edu/apps/modules
module show ncview/2.1.2-fasrc01 # must load prerequisite before running this line
You can look at ~pchan/.bashrc NCL: see the 6 lines of "export" in my bashrc.
mod18: the current modules I am loading now. Called in bashrc during login.
matlab: I am using nodesktop as default. You can call Matlab with desktop by \matlab -nosplash -singleCompThread
Python: https://docs.rc.fas.harvard.edu/kb/python/
spyder is only available in Anaconda3/5.0.1-fasrc02 & Anaconda/5.0.1-fasrc02.
ipython & jupyter are available in all 5.0.1-fasrc01 & 5.0.1-fasrc02.
impi (Intel mpi) is preferred, because it is faster than openmpi and mvapich2, etc., by some 50-100%.
You have to use mpiifort, mpiicc & mpiicpc to replace mpif90, mpicc & mpicxx, in Makefile, configure.wrf, etc.
srun flag --mpi=pmi2
is recommended. (?https://slurm.schedmd.com/mpi_guide.html#intel_mpi)
Login node with the smallest 15-minute load is shown in the last row. This is updated hourly by ~pchan/git/harvard-cluster-monitor/script/loginnode-loadavg.sh
HTML |
---|
<script src="https://gist-it.appspot.com/github/PackardChan/harvard-cluster-monitor/blob/master/output/loginnode-loadavg-sorted"></script> |
You can specify login node by using boslogin01.rc.fas.harvard.edu or boslogin.rc.fas.harvard.edu instead of login.rc.fas.harvard.edu.
Leaving Harvard: Your account will at some point be disabled. https://docs.rc.fas.harvard.edu/kb/leaving-external/
FAQ:
source new-modules.sh
can be removed from bashrc. (recommended, though not necessary)module load centos6/0.0.1-fasrc01
, module purge
might not clean up everything. Logging in again is the best way to clean up everything.complete
to remove/change bash completions, also see ~pchan/.bashrc.Read more in https://docs.rc.fas.harvard.edu/kb/centos-7-transition-faq/, and Plamen's slides kuang_lab-7-8-2018.pdf.