Check Utilization
Middle Earth is a shared resource with no automated means of allocation such as slurm or condor. You can check on current usage directly at a
Jupyterhub console or over SSH.
Storage
You can see a quick readout of shared filesystem utilization with df -h | grep -vE 'user|tmp|efi'. When run on arnor or
gondor you will see something like this:
$ df -h | grep -vE 'user|tmp|efi'
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p2 64G 13G 52G 20% /
/dev/nvme0n1p4 798G 7.1G 791G 1% /local
epyc:/data/epyc 146T 145T 823G 100% /astro/store/epyc
epyc:/nvme 11T 8.0T 2.6T 77% /astro/store/epycn
epyc:/data3/epyc 393T 385T 8.2T 98% /astro/store/epyc3
epyc:/data4/epyc 241T 240T 339G 100% /astro/store/epyc4
epyc:/data2/epyc 175T 155T 20T 89% /astro/store/epyc2
Note that the first entry which is mounted on / is the boot drive. Filling up the
boot drive will cause the compute node to become unstable for other users.
If you are trying to figure out which filesystem a file is located on, run df -h <filename> which will return a single line
table showing you the relevant filesystem and that filesystem’s availability statistics. See below:
$ df -h filename.txt
Filesystem Size Used Avail Use% Mounted on
epyc:/nvme 11T 8.0T 2.6T 77% /astro/store/epycn
If you want to know how big a set of files is, you can use the du utility to count up the size of files on disk. Beware this
may take a long time if you are counting up the space usage of many small files. For example:
$ du -sh my-research-directory
234G my-research-directory
Compute
You can use top to check on CPU Usage. Usernames and unix process ids of the processes using the most memory
and CPU are displayed. The load average displayed is a measure of how many processes are running or waiting for a CPU, and is the
best measure for whether you will be able to get CPU time for your process. You can learn more about load averages
from this blog post.
If compute nodes run out of CPU, they will generally run slowly.
Memory
You can use free -h to see a readout of the free memory on Linux. top can give you some hints on which processes are
using memory (Look at the RES column); however, free -h is better for system-wide statistics. For more info on linux memory
usage consult this blog post.
If compute nodes run out of memory, they will usually run slowly for a time and then linux will kill processes in order to free up memory.
GPUs
On GPU nodes you can use nvidia-smi -l 1 to show a looping readout of the compute and memory utilization across all GPUs on
the system. Processes and owning users are also shown in this view. For more information on how to query GPU data at the command
line consult this Nvidia support article.
If there are no available GPU resources, programs that use the GPU will experience errors. Many GPU enabled programs allow you to choose which device they run on. If you are experiencing errors running a GPU enabled program you might try running it on a different GPU.