How to find where/why a simulation crashed/stopped

1) Check if the model stopped in the scripts or model job

Go into your listings directory:

    cd ~/listings/${TRUE_HOST}
For example on Beluga/Narval:
    cd ~/listings/Beluga
resp.
    cd ~/listings/Narval

List all the listings of the month that failed:

ls -lrt ${GEM_exp}_[MS]*

Open the last listing in your editor or with 'less'

If the model stopped in the ...

a) Scripts listing ${GEM_exp}_S*

Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
Search upwards until you find an error message

b) Model listing ${GEM_exp}_M*

Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')

Each model job consists of 3 main parts:

It starts with a shell code,
followed by the Fortran executable,
followed by another shell part.

1) If all goes well, the first shell part ends with:

   :
INFO: mpirun ...      veeeery long line!!!
INFO: MPI launch after n second(s)
INFO: START of listing processing : time & date
==============       start of parallel run       ==============

Common error messages in this part are:

[warn] Epoll ADD(...) on fd n failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
ORTE has lost communication with a remote daemon.
An ORTE daemon has unexpectedly failed after launch and before ...
An internal error has occurred in ORTE:
ERROR: MPI launch failed after n seconds, killing process m

If you find any of the above error messages in your model listing there was most likely a problem with the machine and you can just restart the simulation.

2) Fortran part

In the Fortran part, every MPI process (there is one MPI process per "tile") writes its own listing. Once the executable (Fortran part) finished running, MPI will collect the listings from all the processes and add them to the main model listing, ${GEM_exp}_M*. To be able to see which line was written by which process, all lines are preceded by the number of the process, for example:

    oe-00000-00000:
    oe-00000-00001:
    oe-00000-00002:
      :

Sometimes the model crashes so badly, that MPI is not able to gather listings from all of the processes. If this happens you can find the listings of the processes under:

INFO: temporary listings for all members in directory_name