Comparaison des versions

Légende

  • Ces lignes ont été ajoutées. Ce mot a été ajouté.
  • Ces lignes ont été supprimées. Ce mot a été supprimé.
  • La mise en forme a été modifiée.

...

There is also the possibility of submitting your job in batch.

Batch processing is generally used to manage the resources (cores and memory) when requests are higher than available resources. But it also allows users to disconnect after having submitted their job(s).

If you have a script of program that needs a lot of time or cores and needs no interaction by the user it is best you submit your job in batch. While on our UQAM servers you only have to submit jobs in batch that need more than 1 core, on cluster of The Alliance you have to submit all jobs that take longer than a few minutes or use more than 1 core in batch!

...

Submitting a job in batch means it gets send to a "scheduler" who will handle all submitted jobs.
A scheduler manages a certain amount of cores and memory. These The cores can be organized in different types of partitions (classes/queues) with different characteristics which determine for how long a job can run, if a job uses full nodes or not, what type of cores it needs and how much memory a job needs. There are no such partitions at UQAM but there are on clusters of The Alliance. Click on the following link for more information: The Alliance job scheduling policies

When submitting a job in batch one needs to specify the resources it needs, like number of cores (MPI and OpenMP), runtime, memory, etc.

According to these specifications the scheduler will organize submitted jobs and determine the priority after which different jobs of different users will get executed. While at UQAM we do not have many submitted jobs so they usually start running immediately, jobs can be queued for quite a while on clusters of The Alliance.

Submit a job

Only Shell scripts can get submitted to the scheduler! If you want to submit anything else you have to write a little Shell script that would execute your script/program the same way you usually run it interactively. In this script you would have to load needed modules etc., maybe change into a certain directory (the submitted job does not know from which directory it got submitted) and execute the script/program you want to run.

...

  • 'jobname' is the name of the shell script, also called job, you want to submit.
  • '-t  time_in_seconds' specifies the wallclock time [in seconds] you acquire for the job. ("wallclock time" or "walltime" comes from "clock on the wall time" and is the "real" time that passes and not, for example, the cpu or system time.) Even if the job is not finished after this time expired the job will get terminated. So better always ask for enough time. However, on larger clusters like Compute Canada clusters jobs asking for more time will be queued longer.
    On our UQAM systems the default wallclock time for single CPU/core jobs is 10 days. For multi core jobs the default time is 1 minute.
    When running on clusters of The Alliance check out their wiki : Time limits on clusters of The Alliance
  • '-jn  listing_name'  specifies the name of the listing or log file of the job. Everything that would appear on the screen when running the job interactively will now get written into this file. Can also get used to check if a job is still running! For example with 'tail -f listing_name'. The default listing name is the basename of the job.
  • '-listing  listing_directory'  specifies the directory in which the listing will get written. The default directory is:
           ~/listings/${TRUE_HOST}
    If you want to use the default listings directory, you should first create the directory ~/listings and then create a symbolic link inside that directory, pointing to a place where you have more space, for example to a directory (that you have to create!) under your data space:
            mkdir -p /dataspace/Listings
            mkdir ~/listings
            ln -s /dataspace/Listings ~/listings/${TRUE_HOST}
    Replace 'dataspace' with the full name of your data directory.
  • '-cpus  number_of_cpus'  specifies the number of cpus you want to use when running the job in parallel using MPI and/or OpenMP. The syntax is the following:  MPIxOpenMP
    For example, if you want to use 4 MPI processes and 2 OpenMP processes you would write:  -cpus 4x2
    If you want to use pure MPI with MPI=4 processes you would write:  -cpus 4x1  or simply  -cpus 4
    If you want to use pure OpenMP with OpenMP=2 processes you would write:  -cpus 1x2
    The default is 1x1.
  • '-mpi'  needs to get added when running a script (or the executable that will get executed within the script) with MPI.

...