Sommaire

How to find where/why a simulation crashed/stopped

...

Find the last listing

If you started your simulation with 'Chunk_lance' you have two possibilities to check how far advanced your simulation is. If you are using 'Um_lance' you can only use the second option.

1) Check file 'chunk_job.log' - only exists when submitting with 'Chunk_lance'

In your config file directory, have a look at the last lines of your file 'chunk_job.log'. I usually list them with:

tail chunk_job.log

Go into your listings directory

...

cd (~/listings/${TRUE_HOST}

For example on Beluga/Narval:
cd ~/listings/Beluga
resp.
cd ~/listings/Narval

List all the listings of the month that failed chronologically

$TRUE_HOST). You can do that by executing the following alias - if existing:

lis

If in the last line in chunk_job.log is written '... started ...' search for the last listing that starts(!) with the job name written in the last line. For example with:

ls -lrt model_job_M*

If in the last line in chunk_job.log is written '... finished ...' search for the last 'cjob_*' respectively 'pjob_*' listing. For example with:

ls -lrt ?job_experiment_name_*

Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).

2) Check the listings

Go into your listings directory (~/listings/$TRUE_HOST). You can do that by executing the following alias - if existing:

lis

List all script and model listings of your simulation chronologically. For example with:

ls -lrt experiment_name_?????? ls -lrt ${GEM_exp}_[MS][_.]*

...

Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).

The last one will be the one in which the simulation failed.

Open the last listing in your editor or with 'less'

If the model stopped in the ...

...

b) Model listing ${GEM_exp}_M*

Each model job consists of 3 main parts:

It starts with a shell code,
followed by the Fortran executable,
followed by another shell part.

Below are a few different suggestions to find why the model crashed. I usually try them one after the other until I find the problem:

Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')

...

Each model job consists of 3 main parts:

It starts with a shell code,
followed by the Fortran executable,
followed by another shell part.

and search upwards for '00000:'. This will bring you to the last line of the listing of the main process. From there look upwards if you find anything out of the ordinary. The error might be several line up!
Search case insensitive(!) for 'Traceback'.
Search case insensitive(!) for 'ABORT'.

c) Chunk_lance listing cjob_* or pjob_*

If you submitted your simulation with 'Chunk_lance' you will also have listings starting with 'cjob_*' respectively 'pjob_*'. These listings include the calls to the scripts and model above.

Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
Search upwards until you find an error message

Common error messages and how to interpret them

First shell part

Every model listing start with this shell part.

Epoll ADD, ORTE, MPI launch failed

...

Common error messages in this part that point to a problem with the machine are:
...
Volet
   :
INFO: mpirun ...      veeeery long line!!!
INFO: MPI launch after n second(s)
INFO: START of listing processing : time & date
==============       start of parallel run       ==============
...

Fortran executable

In the Fortran part, every MPI process (there is one MPI process per "tile") writes its own listing. Once the Fortran executable finished running, MPI will collect the listings from all the processes and add them to the main model listing, ${GEM_exp}_M*. To be able to see which line was written by which process, all these lines are preceded by the number of the process, for example:

...

INFO: temporary listings for all members in directory_name

You can find the line above in your model listing!
In the directory 'directory_name' you have one directory per process, ?????, which contains the listing of said process.

...

If the model stopped in the Fortran executable, most of the time you can find an error message at the end of the listing of process 0. To get there jump to the end of the listing and then search backwards to the end of the listing of the main process. (When using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G' and then search upward with '?00000:'). But even from the end of the main model listing you might still have to look several lines up to find an error. However, once you reached a line saying:
THE TIME STEP n IS COMPLETED
there is probably no error above anymore. In that case you will have to look into the listings of all the other processes.

For a large grid you might have use a lot of MPI processes and it is not easy to find an error message in them. Try looking for the word "Traceback". If you are luck this takes you exactly to the traceback of the error.

If the executable started running but was not able to finish the first timestep, meaning if you do not have at least one line saying:

...

it is possible that there was a problem reading the restart files. Check in the listings of all processes if you find a message like:

oe-00000-...: forrtl: severe (24): end-of-file during read, unit 999, file .../gem_restart

If you find such a line you need to restart the simulation from the previous restart file.
If you find no error message at all there might have been a problem with the machine problem and the MPI processes could not all get started or that your restart files are corrupted but you can't see it. If this happens for the first time for a given month you can just restart the simulation. But if this happens more than once I would restart the simulation from the previous monthrestart file, assuming there is a problem with the restart files.

...

If the model stops more than once at the same timestep have a look at the listings of ALL processes to see what went wrong.

If the model stops while writing the restart files you always have to restart the simulation from the previous month.
Meaning if you have the line in the listing saying:

oe-00000-00000: WRITING A RESTART FILE AT TIMESTEP # ...

But not the line saying:

oe-00000-00000: GEM_CTRL: END OF CURRENT TIME SLICE AT TIMESTEP ...

When the Fortran executable finishes fine, you will see the following messages at the end of the main process listing:

oe-00000-00000: Memory==> ...
      :
    oe-00000-00000: __________________TIMINGS ON PE #0_________________________
      :
    oe-00000-00000: .........RESTART
And then a big '****' box with an "END EXECUTION" inside.

Common error messages of the main model and their meaning:

Traceback

If you find a 'Traceback' you will hopefully also get some lines similar to the following:

Volet
`oe-00000-00000: Image PC Routine Line Source` `oe-00000-00000: maingemdm 0000000002287C5B Unknown Unknown Unknown` `oe-00000-00000: maingemdm 0000000000EB8391 fpe_handler_ 52 ifort_fpe_handler.`

...

F90
oe-00000-00000: maingemdm 0000000002272607 Unknown Unknown Unknown
oe-00000-00000: libpthread-2.30.s 000014B7BFE2C0F0 Unknown Unknown Unknown

oe-00000-00000: maingemdm          0000000000AE51E3  lightning_lpi_             80  lightning_lpi.F90

oe-00000-00000: maingemdm 00000000009169A4 calcdiag_mp_calcd 546 calcdiag.F90

oe-00000-00000: maingemdm          0000000000800DA8  phyexe_                   153  phyexe.F90

:

oe-00000-00000: maingemdm          00000000004956EE  gem_ctrl_                  42  gem_ctrl.F90

oe-00000-00000: maingemdm          000000000041649C  gemdm_                     55  gemdm.F90

oe-00000-00000: maingemdm          0000000000416370  MAIN__                      2  maingemdm.F90

Starting from the top, look for the first source files that is part of the CRCM-GEM source code. In the example above, the simulation crashed (most probably) in line '80' of the source code file 'lightning_lpi.F90' .

Dimensions differ from previous specification

Volet
oe-00000-00000: size(pp, 1 )= 71280 high= 73062 low= 1 oe-00000-00000: ERROR: gmm_create, requested dimensions differ from previous specification (res oe-00000-00000: tart/create) oe-00000-00000: ERROR: gmm_create, variable name ="XTH "

=> Possible reason: MPI-tiles too small

Bad canopy iteration temperature

Volet
oe-00000-00071: 0BAD CANOPY ITERATION TEMPERATURE 4 51 373.24 6 1 oe-00000-00071: 5301.31 384.41 315.63 1100.35 0.00 4224.56 234.27 13.73 0.00 oe-00000-00071: 373.24 281.25 273.15 oe-00000-00071: 0****** END TSOLVC ********************************************************************** -2

Crash in aprep.f

Crash in line with a division by 'THPOR'

If your simulation is crashing in a line with a division by 'THPOR' make sure the number of SAND and CLAY levels you set to be read in your 'physics_input_table' corresponds to the actual number of levels in your geophysical fields.

Crash after restart

If job was restarted from the restart file check the following:

1) Check if the permanent bus is still the same as before. To do that you can compare the current listing with the crash with the previous one (that should be archived in ${CLIMAT_archdir}/Listings/listings_....zip) with 'xxdiff'. On Narval you will have to load 'module add difftools' to get access to xxdiff. The permanent bus will change if the executable was changed in a way that fields got added or removed from the permanent bus.

2) Did you add any fields to outcfg.out? There are certain output fields that cannot get added to outcfg.out once a simulations started. Or, if these fields were present from the start, no other fields can get added. However, one can replace one output field with another. Below is a list of these special fields:

Group 1: 'clse', 'cec ', 'cecm', 'ced ', 'cedm', 'cep ', 'cepm', 'cem ', 'cemm', 'cer ', 'cerm', 'ces ', 'cesm', 'cqt ', 'cqc ', 'cqcm', 'cqd ', 'cqdm', 'cqp ', 'cqpm', 'cqm ', 'cqmm', 'cqr ', 'cqrm', 'cqs ', 'cqsm', 'cey ', 'ceym', 'cef ', 'cefm', 'cqy ', 'cqym', 'cqf ', 'cqfm'

Group 2: 'fdac', 'fdre'

ERROR: key object has dimensions smaller than value assigned - itf_phy/PHYOUT

Same as point 2) above!

...

Second shell part

The second shell part starts with the lines:

...

If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.

**Chunk_job listing (cjob_* or pjob_*)**

If you cannot find any error message in the model listing, check the listing ending on *.s. When having submitted the simulation with Chunk_lance have a look at the listing 'cjob_*.s'. When all went "well", this listing will be empty. But sometimes you can find messages in these files like the following:

Node failure

Volet
slurmstepd: error: * JOB 17891032 ON nc20539 CANCELLED AT 2023-06-14T04:50:44 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *

=> Obviously a problem with a node. Just resubmit (continue) your simulation.

Time limit exceeded

Volet
slurmstepd: error: * JOB 13690472 ON nc30342 CANCELLED AT 2023-02-12T00:50:04 DUE TO TIME LIMIT *

=> Your job ran out of time. If your jobs usually fit in the wall time you asked for this might be due to slow access to the filesystems. In this case you can wait until the filesystem problems have been solved or just resubmit and hope for the best. You can also ask for more walltime (BACKEND_time_mod) or run less days per job (Fcst_rstrt_S).
If you just started your simulations your should either ask for more walltime (BACKEND_time_mod) and/or run less days per job (Fcst_rstrt_S).

Out of memory

Volet
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=13861528.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

=> You job ran out of memory. Ask for more MPI tiles (GEM_ptopo). You could also ask for more memory (BACKEND_cm) but this usually means that your jobs will be queued for much longer.

Raccourcis espace

Arborescence des pages

Comparaison des versions

Ancienne version 15

Nouvelle version Actuel

Légende

How to find where/why a simulation crashed/stopped

Find the last listing

1) Check file 'chunk_job.log' - only exists when submitting with 'Chunk_lance'

List all the listings of the month that failed chronologically

2) Check the listings

Open the last listing in your editor or with 'less'

b) Model listing ${GEM_exp}_M*

c) Chunk_lance listing cjob_* or pjob_*

Common error messages and how to interpret them

First shell part

Epoll ADD, ORTE, MPI launch failed

Common error messages in this part that point to a problem with the machine are:
...
Volet
   :
INFO: mpirun ...      veeeery long line!!!
INFO: MPI launch after n second(s)
INFO: START of listing processing : time & date
==============       start of parallel run       ==============
...

Fortran executable

Traceback

Dimensions differ from previous specification

Bad canopy iteration temperature

Crash in aprep.f

Crash in line with a division by 'THPOR'

Crash after restart

ERROR: key object has dimensions smaller than value assigned - itf_phy/PHYOUT

Second shell part

**Chunk_job listing (cjob_* or pjob_*)**

Node failure

Time limit exceeded

Out of memory

Raccourcis espace

Arborescence des pages

Historique de la page

Comparaison des versions

Ancienne version 15

Nouvelle version Actuel

Légende

How to find where/why a simulation crashed/stopped

Find the last listing

1) Check file 'chunk_job.log' - only exists when submitting with 'Chunk_lance'

List all the listings of the month that failed chronologically

2) Check the listings

Open the last listing in your editor or with 'less'

b) Model listing ${GEM_exp}_M*

c) Chunk_lance listing cjob_* or pjob_*

Common error messages and how to interpret them

First shell part

Epoll ADD, ORTE, MPI launch failed

Common error messages in this part that point to a problem with the machine are:... Volet :INFO: mpirun ... veeeery long line!!!INFO: MPI launch after n second(s) INFO: START of listing processing : time & date============== start of parallel run ==============...

Fortran executable

Traceback

Dimensions differ from previous specification

Bad canopy iteration temperature

Crash in aprep.f

Crash in line with a division by 'THPOR'

Crash after restart

ERROR: key object has dimensions smaller than value assigned - itf_phy/PHYOUT

Second shell part

Chunk_job listing (cjob_* or pjob_*)

Node failure

Time limit exceeded

Out of memory

Common error messages in this part that point to a problem with the machine are:
...
Volet
:
INFO: mpirun ... veeeery long line!!!
INFO: MPI launch after n second(s)
INFO: START of listing processing : time & date
============== start of parallel run ==============
...

**Chunk_job listing (cjob_* or pjob_*)**