How to find where/why a simulation crashed/stopped

Find the last listing

If you started your simulation with 'Chunk_lance' you have two possibilities to check how far advanced your simulation is. If you are using 'Um_lance' you can only use the second option.

1) Check file 'chunk_job.log' - only exists when submitting with 'Chunk_lance'

In your config file directory, have a look at the last lines of your file 'chunk_job.log'. I usually list them with:

     tail chunk_job.log

Go into your listings directory (~/listings/$TRUE_HOST). You can do that by executing the following alias - if existing:

    lis

If in the last line in chunk_job.log is written '... started ...' search for the last listing that starts(!) with the job name written in the last line. For example with:

    ls  -lrt  model_job_M*

If in the last line in chunk_job.log is written '... finished ...' search for the last 'cjob_*' respectively 'pjob_*' listing. For example with:

    ls  -lrt  ?job_experiment_name_*

Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).

2) Check the listings

Go into your listings directory (~/listings/$TRUE_HOST). You can do that by executing the following alias - if existing:

    lis

List all script and model listings of your simulation chronologically. For example with:

    ls  -lrt experiment_name_??????_[MS][_.]*

Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).

The last one will be the one in which the simulation failed.

Open the last listing in your editor or with 'less'

If the model stopped in the ...

a) Scripts listing ${GEM_exp}_S*

  • Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
  • Search upwards until you find an error message

b) Model listing ${GEM_exp}_M*

Each model job consists of 3 main parts:

  • It starts with a shell code,
  • followed by the Fortran executable,
  • followed by another shell part.

Below are a few different suggestions to find why the model crashed. I usually try them one after the other until I find the problem:

  • Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G') and search upwards for '00000:'. This will bring you to the last line of the listing of the main process. From there look upwards if you find anything out of the ordinary. The error might be several line up!
  • Search case insensitive(!) for 'Traceback'.
  • Search case insensitive(!) for 'ABORT'.

c) Chunk_lance listing cjob_* or pjob_*

If you submitted your simulation with 'Chunk_lance' you will also have listings starting with 'cjob_*' respectively 'pjob_*'. These listings include the calls to the scripts and model above.

  • Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
  • Search upwards until you find an error message

Common error messages and how to interpret them

First shell part

Every model listing start with this shell part.

Epoll ADD, ORTE, MPI launch failed

Common error messages in this part that point to a problem with the machine are:

  • [warn] Epoll ADD(...) on fd n failed.  Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
  • ORTE has lost communication with a remote daemon.
  • An ORTE daemon has unexpectedly failed after launch and before ...
  • An internal error has occurred in ORTE:
  • ERROR: MPI launch failed after n seconds, killing process m

If you find any of the above error messages in your model listing you can just restart the simulation.

If this part stops with other messages, and you do not understand what they mean, search for the message in your scripts under:

     ~/modeles/GEMDM/version/bin

If all goes well, the first shell part ends with:

   :
INFO: mpirun ...      veeeery long line!!!

INFO: MPI launch after n second(s)
INFO: START of listing processing : time & date
==============       start of parallel run       ==============


Fortran executable

In the Fortran part, every MPI process (there is one MPI process per "tile") writes its own listing. Once the Fortran executable finished running, MPI will collect the listings from all the processes and add them to the main model listing, ${GEM_exp}_M*. To be able to see which line was written by which process, all these lines are preceded by the number of the process, for example:

    oe-00000-00000:
    oe-00000-00001:
    oe-00000-00002:
      :

Sometimes the model crashes so badly, that MPI is not able to gather listings from all of the processes. If this happens you can find the listings of the processes under:

    INFO: temporary listings for all members in directory_name

You can find the line above in your model listing!
In the directory 'directory_name' you have one directory per process, ?????, which contains the listing of said process.


If the model stopped in the Fortran executable, most of the time you can find an error message at the end of the listing of process 0. To get there jump to the end of the listing and then search backwards to the end of the listing of the main process. (When using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G' and then search upward with '?00000:'). But even from the end of the main model listing you might still have to look several lines up to find an error. However, once you reached a line saying:
    THE TIME STEP  n IS COMPLETED
there is probably no error above anymore. In that case you will have to look into the listings of all the other processes.

For a large grid you might have use a lot of MPI processes and it is not easy to find an error message in them. Try looking for the word "Traceback". If you are luck this takes you exactly to the traceback of the error.


If the executable started running but was not able to finish the first timestep, meaning if you do not have at least one line saying:

    THE TIME STEP  n IS COMPLETED

it is possible that there was a problem reading the restart files. Check in the listings of all processes if you find a message like:

   oe-00000-...: forrtl: severe (24): end-of-file during read, unit 999, file .../gem_restart

If you find such a line you need to restart the simulation  from the previous restart file.
If you find no error message at all there might have been a problem with the machine
and the MPI processes could not all get started or that your restart files are corrupted but you can't see it. If this happens for the first time for a given month you can just restart the simulation. But if this happens more than once I would restart the simulation from the previous restart file, assuming there is a problem with the restart files.


If the executable stopped somewhere in the middle and you cannot find an error message but the last line of the listing of the main process says:

    OUT_DYN- WRITING ...
or
    OUT_PHY- WRITING ...

chances are the model got stuck while writing the output. In that case it might be enough to restart the simulation.
However, it might be safer to remove any output files of the previous attempt. You can find them under:

    ~/MODEL_EXEC_RUN/beluga/${GEM_exp}/RUNMOD/output/cfg_0000/*step*/???-???/*

In case you are not running full months make sure to only remove output from the job that stopped and not from any previous job!!!

But you can of course always check out listings of ALL processes.


If the model stops more than once at the same timestep have a look at the listings of ALL processes to see what went wrong.


If the model stops while writing the restart files you always have to restart the simulation from the previous month.
Meaning if you have the line in the listing saying:

    oe-00000-00000: WRITING A RESTART FILE AT TIMESTEP # ...

But not the line saying:

    oe-00000-00000: GEM_CTRL: END OF CURRENT TIME SLICE AT TIMESTEP ...


When the Fortran executable finishes fine, you will see the following messages at the end of the main process listing:

    oe-00000-00000: Memory==> ...
      :  
    oe-00000-00000:  __________________TIMINGS ON PE #0_________________________
      :  
    oe-00000-00000:  .........RESTART
And then a big '****' box with an "END EXECUTION" inside.


Common error messages of the main model and their meaning:

Traceback

If you find a 'Traceback' you will hopefully also get some lines similar to the following:

oe-00000-00000: Image              PC                Routine            Line        Source
oe-00000-00000: maingemdm          0000000002287C5B  Unknown               Unknown  Unknown
oe-00000-00000: maingemdm          0000000000EB8391  fpe_handler_               52  ifort_fpe_handler.F90
oe-00000-00000: maingemdm          0000000002272607  Unknown               Unknown  Unknown
oe-00000-00000: libpthread-2.30.s  000014B7BFE2C0F0  Unknown               Unknown  Unknown
oe-00000-00000: maingemdm          0000000000AE51E3  lightning_lpi_             80  lightning_lpi.F90
oe-00000-00000: maingemdm          00000000009169A4  calcdiag_mp_calcd         546  calcdiag.F90
oe-00000-00000: maingemdm          0000000000800DA8  phyexe_                   153  phyexe.F90
  :
oe-00000-00000: maingemdm          00000000004956EE  gem_ctrl_                  42  gem_ctrl.F90
oe-00000-00000: maingemdm          000000000041649C  gemdm_                     55  gemdm.F90
oe-00000-00000: maingemdm          0000000000416370  MAIN__                      2  maingemdm.F90

 
Starting from the top, look for the first source files that is part of the CRCM-GEM source code. In the example above, the simulation crashed (most probably) in line '80' of the source code file 'lightning_lpi.F90' .

Dimensions differ from previous specification

oe-00000-00000:  size(pp,           1 )=       71280  high=       73062  low=           1
oe-00000-00000:  ERROR: gmm_create, requested dimensions differ from previous specification (res
oe-00000-00000:  tart/create)
oe-00000-00000:  ERROR: gmm_create, variable name ="XTH                             "

=> Possible reason: MPI-tiles too small

Bad canopy iteration temperature

oe-00000-00071: 0BAD CANOPY ITERATION TEMPERATURE     4 51          373.24   6   1
oe-00000-00071:      5301.31    384.41    315.63   1100.35      0.00   4224.56    234.27     13.73      0.00
oe-00000-00071:       373.24    281.25    273.15
oe-00000-00071: 0********  END  TSOLVC  ************************************************************************      -2


Crash in aprep.f

Crash in line with a division by 'THPOR'

If your simulation is crashing in a line with a division by 'THPOR' make sure the number of SAND and CLAY levels you set to be read in your 'physics_input_table' corresponds to the actual number of levels in your geophysical fields.

Crash after restart

If job was restarted from the restart file check the following:

    1) Check if the permanent bus is still the same as before. To do that you can compare the current listing with the crash with the previous one (that should be archived in ${CLIMAT_archdir}/Listings/listings_....zip) with 'xxdiff'. On Narval you will have to load 'module add difftools' to get access to xxdiff. The permanent bus will change if the executable was changed in a way that fields got added or removed from the permanent bus.

    2) Did you add any fields to outcfg.out? There are certain output fields that cannot get added to outcfg.out once a simulations started. Or, if these fields were present from the start, no other fields can get added. However, one can replace one output field with another. Below is a list of these special fields:

Group 1: 'clse', 'cec ', 'cecm', 'ced ', 'cedm', 'cep ', 'cepm', 'cem ', 'cemm', 'cer ', 'cerm', 'ces ', 'cesm', 'cqt ', 'cqc ', 'cqcm', 'cqd ', 'cqdm', 'cqp ', 'cqpm', 'cqm ', 'cqmm', 'cqr ', 'cqrm', 'cqs ', 'cqsm', 'cey ', 'ceym', 'cef ', 'cefm', 'cqy ', 'cqym', 'cqf ', 'cqfm'

Group 2: 'fdac', 'fdre'

ERROR: key object has dimensions smaller than value assigned - itf_phy/PHYOUT

Same as point 2) above!

Second shell part

The second shell part starts with the lines:

   :
==============       end of parallel run       ==============
INFO: END of listing processing : date & time

However, sometimes there can be an error message like:

==============       end of parallel run       ==============
INFO: END of listing processing : date & time
INFO: RUN FAILED
INFO: first 10 failing processes :
fail.00000-... fail.00000-...

If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.

Chunk_job listing (cjob_* or pjob_*)

If you cannot find any error message in the model listing, check the listing ending on *.s. When having submitted the simulation with Chunk_lance have a look at the listing 'cjob_*.s'. When all went "well", this listing will be empty. But sometimes you can find messages in these files like the following:

Node failure

slurmstepd: error: *** JOB 17891032 ON nc20539 CANCELLED AT 2023-06-14T04:50:44 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

=> Obviously a problem with a node. Just resubmit (continue) your simulation.

Time limit exceeded

slurmstepd: error: *** JOB 13690472 ON nc30342 CANCELLED AT 2023-02-12T00:50:04 DUE TO TIME LIMIT ***

=> Your job ran out of time. If your jobs usually fit in the wall time you asked for this might be due to slow access to the filesystems. In this case you can wait until the filesystem problems have been solved or just resubmit and hope for the best. You can also ask for more walltime (BACKEND_time_mod) or run less days per job (Fcst_rstrt_S).
If you just started your simulations your should either ask for more walltime
(BACKEND_time_mod) and/or run less days per job (Fcst_rstrt_S).

Out of memory

slurmstepd: error: Detected 3 oom-kill event(s) in StepId=13861528.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

=> You job ran out of memory. Ask for more MPI tiles (GEM_ptopo). You could also ask for more memory (BACKEND_cm) but this usually means that your jobs will be queued for much longer.



  • Aucune étiquette
Écrire un commentaire...