When you type commands in a login shell (window/terminal) and see a response displayed, you are working interactively. This is the most common way of working. The main downsides are that one cannot disconnect while a process is running (it would get terminated) and if many users run many jobs interactively the computer might get "overworked", slowing down all processes. Therefore, it is sometimes more practical to send jobs in the background or to submit batch jobs.

Background processes

Send a job/process in the background

Processes that open windows like emacs, Matlab, xrec, xxdiff and others block further usage of the terminal (window) from which they were opened. To be able to continue using the terminal for other things one can send such processes in the background. Either right away by adding a '&' at the end of the command. For example:

emacs filename &
matlab &


If you forgot to at the '&' but still like to continue using the terminal you can send the process in the background with the commands:

    Ctrl-Z
followed by
    bg
(for background)


It is also possible to send commands like rsync in the background with:

    rsync [keys] source destination  > logfile 2>&1 &

The logfile will contain the output of rsync which usually appears on the screen.

Check running processes

Once a process which did not open it's own window is running in the background you cannot see it anymore in the terminal from which you started it. To see processes running in the background (as well as all other processes) you can use the command 'ps'. For example:

    ps -fu username

To get more information about 'ps' execute the command : man ps

Another way to see all running processes is with 'top'. For example:

    top -u username

Kill a background process

Once a process is running in the background you cannot terminate it anymore with Ctrl-C or Ctrl-D. If the process you sent in the background has its own window you can kill the window and with that usually the process. But if you sent a process in the background which does not have a window you do not even see it anymore in the terminal from which you started it. But you can see it with 'ps' or 'top' - see above.
Once you found the process you want to terminate you can kill it with:

    kill -9 Job-ID


Batch processes

There is also the possibility of submitting your job in batch.

Batch processing is generally used to manage the resources (cores and memory) when requests are higher than available resources. But it also allows users to disconnect after having submitted their job(s).

If you have a script of program that needs a lot of time or cores and needs no interaction by the user it is best you submit your job in batch. While on our UQAM servers you only have to submit jobs in batch that need more than 1 core, on cluster of The Alliance you have to submit all jobs that take longer than a few minutes or use more than 1 core in batch!

Scheduler

Submitting a job in batch means it gets send to a "scheduler" who will handle all submitted jobs.
A scheduler manages a certain amount of cores and memory. The cores can be organized in different types of partitions (classes/queues) with different characteristics which determine for how long a job can run, if a job uses full nodes or not, what type of cores it needs and how much memory a job needs. There are no such partitions at UQAM but there are on clusters of The Alliance. Click on the following link for more information: The Alliance job scheduling policies

When submitting a job in batch one needs to specify the resources it needs, like number of cores (MPI and OpenMP), runtime, memory, etc.

According to these specifications the scheduler will organize submitted jobs and determine the priority after which different jobs of different users will get executed. While at UQAM we do not have many submitted jobs so they usually start running immediately, jobs can be queued for quite a while on clusters of The Alliance.

Submit a job

Only Shell scripts can get submitted to the scheduler! If you want to submit anything else you have to write a little Shell script that would execute your script/program the same way you usually run it interactively. In this script you would have to load needed modules etc., maybe change into a certain directory (the submitted job does not know from which directory it got submitted) and execute the script/program you want to run.

The way to submit, check, and kill a job depends on the scheduling system and on the way it is installed. If you want to use The Alliance way of submitting jobs on their clusters check out the following link: Submit a job on clusters of The Alliance
However, to make our life easier, the RPN environment contains a set of tools that will do all "adjustments" for you and that can always get used the same way on all machines on which the RPN environment is installed.

Soumet

'soumet' is a tool to submit jobs in RPN environments. To submit any job on our UQAM servers, always use the command 'soumet'. On clusters of The Alliance it is up to you if you want to use soumet or not, provided you are using the RPN environment. If you do not you have to use "their" way.

The advantage of using 'soumet' is that you do not have to add any scheduler directives at the top of your script. 'soumet' will do this for you according to the arguments you give it.

The command to submit a job/script with soumet could look like this:

    soumet jobname [ -t time_in_seconds  -listing listing_directory  -jn listing_name  -cpus number_of_cpus  -mpi ]

Where:

  • 'jobname' is the name of the shell script, also called job, you want to submit.
  • '-t  time_in_seconds' specifies the wallclock time [in seconds] you acquire for the job. ("wallclock time" or "walltime" comes from "clock on the wall time" and is the "real" time that passes and not, for example, the cpu or system time.) Even if the job is not finished after this time expired the job will get terminated. So better always ask for enough time. However, on larger clusters like Compute Canada clusters jobs asking for more time will be queued longer.
    On our UQAM systems the default wallclock time for single CPU/core jobs is 10 days. For multi core jobs the default time is 1 minute.
    When running on clusters of The Alliance check out their wiki : Time limits on clusters of The Alliance
  • '-jn  listing_name'  specifies the name of the listing or log file of the job. Everything that would appear on the screen when running the job interactively will now get written into this file. The default listing name is the basename of the job.
  • '-listing  listing_directory'  specifies the directory in which the listing will get written. The default directory is:
           ~/listings/${TRUE_HOST}
    If you want to use the default listings directory, you should first create the directory ~/listings and then create a symbolic link inside that directory, pointing to a place where you have more space, for example to a directory (that you have to create!) under your data space:
            mkdir -p /dataspace/Listings
            mkdir ~/listings
            ln -s /dataspace/Listings ~/listings/${TRUE_HOST}
    Replace 'dataspace' with the full name of your data directory.
  • '-cpus  number_of_cpus'  specifies the number of cpus you want to use when running the job in parallel using MPI and/or OpenMP. The syntax is the following:  MPIxOpenMP
    For example, if you want to use 4 MPI processes and 2 OpenMP processes you would write:  -cpus 4x2
    If you want to use pure MPI with MPI=4 processes you would write:  -cpus 4x1  or simply  -cpus 4
    If you want to use pure OpenMP with OpenMP=2 processes you would write:  -cpus 1x2
    The default is 1x1.
  • '-mpi'  needs to get added when running a script (or the executable that will get executed within the script) with MPI.


To get more information about the command simply execute:

    soumet -h

Check on jobs

As said above, also the way to check (and kill) jobs depends on the queueing system. Therefore, we created a script called 'qs' which you can find in my ovbin:

    ~winger/ovbin/qs

On our UQAM servers as well as on clusters of The Alliance (when using the RPN environment) you already have an alias pointing to the above command, called:

    qs

The column 'ST' or 'state' shows the status of the job:
    PD : pending
    R    : running
    C    : cancelled

The column 'TMAX' or 'wallclock' shows the wall clock time. The run time the job requested. Once this time runs out the job will get terminated.

    qsa

Only available on clusters of the Alliance.
If you would like to know what the other users of the ESCER Center are running you can use the command 'qsa' ('a' for all). I am maintaining the list of all users by hand. If I forgot someone, let me (Katja) know.

    qsp

Only available on clusters of the Alliance.
The command 'qsp' ('p' for project) shows you all jobs currently in all the project allocations to which you can submit jobs. In the first section is a detailed list of all jobs for each project account and the last section show you the current "usage". Here you can see "workload" of the projects. The LevelFS is a measure for the how much a project was used in the recent past and how high therefore the priority is. The higher the LevelFS the faster a job will start running.

If a def-account was not used during the last few days, the total written will be '0/1' or if someone just submitted a job it will be 'n/1'and the LevelFS will be muuuch smaller than 0. In that case you cannot trust the LevelFS at all. But the RawUsage will give you an indication about how much the account was used in the recent past. The smaller the RawUsage the less it was used.

Kill a job

Check the job-ID number with 'qs' or '~winger/ovbin/qs' and then use 'qdel' to kill your job:
  qdel job-ID

Depending on the machine, this might be an alias again.


  • Aucune étiquette
Écrire un commentaire...