top

To check cpu and memory usage of active jobs on a server use the command 'top':
top

To only see only jobs of a specific user you can call it with:
top -u username
Alternatively, you can also press 'u' once top is open. In the 6st line you will then see :
Which user (blank for all)
Then you can just start typing or copy-pasting the username.

By default, 'top' sorts all jobs by cpu usage. To sort them by memory usage just type 'M' (capital 'm') once 'top' is open.

To quit 'top' just press 'q'.

Output explanation

The column headings in the process list are as follows (most important ones are in red):

PID: Process ID.
USER: The owner of the process.
PR: Process priority.
NI: The nice value of the process.
VIRT: Amount of virtual memory used by the process. On our servers, currently, the maximum virtual memory a job can use is 25% of the total memory. Which means 64 GB on most of our servers.
RES: Amount of resident memory used by the process. This is the actual memory your process is using!!!
SHR: Amount of shared memory used by the process.
S: Status of the process. (See the list below for the values this field can take).
%CPU: The share of CPU time used by the process since the last update. Can go up to a little more than 100%. When using shared memory parallelism (OpenMP) this value can go up to 100% times the number of shared memory processes.
%MEM: The share of physical memory used.
TIME+: Total CPU time used by the task in [minutes:seconds].
COMMAND: The command name or command line (name + options).

In theory, all users together, including the system, can use up to almost 100% of the total memory before things are starting to get really really slow. But since one user usually does not know what all the others are doing, we ask each user not to use more than 25% of the total memory for all of her/his processes together.

For as long as the CPU time (column: 'TIME+') keeps increasing, you do not have to worry about jobs with the status (column: 'S') of 'R', 'S' or 'D'. But if the CPU time stops increasing for a while, you should check if this job is still needed or if you can terminate it - especially if it uses several % of memory.

Jobs with a status of 'T' or 'Z' should always get killed.

And if you see you have processes running that you recognize(!) that should not be there anymore, they are probably zombies and you should kill them.

The following is from:
https://www.howtogeek.com/668986/how-to-use-the-linux-top-command-and-understand-its-output/

The first line of numbers on the dashboard includes the time, how long your computer has been running, the number of people logged in, and what the load average has been for the past one, five, and 15 minutes. The second line shows the number of tasks and their states: running, stopped, sleeping, or zombie.

The third line displays the following CPU values:

us: Amount of time the CPU spends executing processes for people in “user space.”
sy: Amount of time spent running system “kernel space” processes.
ni: Amount of time spent executing processes with a manually set nice value.
id: Amount of CPU idle time.
wa: Amount of time the CPU spends waiting for I/O to complete.
hi: Amount of time spent servicing hardware interrupts.
si: Amount of time spent servicing software interrupts.
st: Amount of time lost due to running virtual machines (“steal time”).

The fourth line shows the total amount of physical memory, and how much is free, used, and buffered or cached.

The fifth line shows the total amount (also in kibibytes) of swap memory, and how much is free, used, and available. The latter includes memory that’s expected to be recoverable from caches.

The status of the process can be one of the following:

D: Uninterruptible sleep
R: Running
S: Sleeping
T: Traced (stopped)
Z: Zombie

ps & kill

You can check which processes you have open with:

ps -fu $USER | less

If you cannot find where you opened a process to close it down properly you can kill it with 'kill'. You only need to kill the master process. For example, if you get something like the following:

            parent
UID          PID    PPID C STIME TTY          TIME CMD
username 1460822       1 0 May23 ?        00:01:48 tmux
username 945179 1460822 0 May25 pts/20   00:00:00 -bash
username 945272 945179 0 May25 pts/20   00:37:36 /sca/.../jupyter-notebook --no-browser
username 969070 945272 0 May25 ?        00:10:16 /sca/.../python -m ipykernel_launcher -f /.../kernel-...json
username 987828 945272 0 May25 ?        00:09:37 /sca/.../python -m ipykernel_launcher -f /.../kernel-...json

In the example above, the PID (process ID) '1460822', is the main master process. It does not have a "parent", the PPID (parent process ID) is 1. This is the one you need to kill, then all it's "children", "grandchildren" and "great-grandchildren" etc. might(!) get killed as well. If they do not get killed you will have to kill them with 'kill -9 ..' as well.
Note that processes are not always sorted in order!

Sometimes, it happens that processes do not have a parent anymore, then you need to kill them with their own PID.

The command to kill a process is:

kill -9 PID

So, for the example above:
kill -9 1460822

Raccourcis espace

Arborescence des pages

top

Output explanation

ps & kill

Raccourcis espace

Arborescence des pages

Job monitoring

top

Output explanation

ps & kill