How to find where/why a simulation crashed/stopped
Find the last listing
If you started your simulation with 'Chunk_lance' you have two possibilities to check how far advanced your simulation is. If you are using 'Um_lance' you can only use the second option.
1) Check file 'chunk_job.log' - only exists when submitting with 'Chunk_lance'
In your config file directory, have a look at the last lines of your file 'chunk_job.log'. I usually list them with:
tail chunk_job.log
Go into your listings directory (~/listings/$TRUE_HOST). You can do that by executing the following alias - if existing:
lis
If in the last line in chunk_job.log is written '... started ...' search for the last listing that starts(!) with the job name written in the last line. For example with:
ls -lrt model_job_M*
If in the last line in chunk_job.log is written '... finished ...' search for the last 'cjob_*' respectively 'pjob_*' listing. For example with:
ls -lrt ?job_experiment_name_*
Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).
2) Check the listings
Go into your listings directory (~/listings/$TRUE_HOST). You can do that by executing the following alias - if existing:
lis
List all script and model listings of your simulation chronologically. For example with:
ls -lrt experiment_name_??????_[MS][_.]*
Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).
The last one will be the one in which the simulation failed.
Open the last listing in your editor or with 'less'
If the model stopped in the ...
a) Scripts listing ${GEM_exp}_S*
- Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
- Search upwards until you find an error message
b) Model listing ${GEM_exp}_M*
Each model job consists of 3 main parts:
- It starts with a shell code,
- followed by the Fortran executable,
- followed by another shell part.
Below are a few different suggestions to find why the model crashed. I usually try them one after the other until I find the problem:
- Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G') and search upwards for '00000:'. This will bring you to the last line of the listing of the main process. From there look upwards if you find anything out of the ordinary. The error might be several line up!
- Search case insensitive(!) for 'Traceback'.
- Search case insensitive(!) for 'ABORT'.
c) Chunk_lance listing cjob_* or pjob_*
If you submitted your simulation with 'Chunk_lance' you will also have listings starting with 'cjob_*' respectively 'pjob_*'. These listings include the calls to the scripts and model above.
- Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
- Search upwards until you find an error message
Common error messages and how to interpret them
First shell part
Every model listing start with this shell part.
Epoll ADD, ORTE, MPI launch failed
Common error messages in this part that point to a problem with the machine are:
- [warn] Epoll ADD(...) on fd n failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
- ORTE has lost communication with a remote daemon.
- An ORTE daemon has unexpectedly failed after launch and before ...
- An internal error has occurred in ORTE:
- ERROR: MPI launch failed after n seconds, killing process m
If you find any of the above error messages in your model listing you can just restart the simulation.
If this part stops with other messages, and you do not understand what they mean, search for the message in your scripts under:
~/modeles/GEMDM/version/bin
If all goes well, the first shell part ends with:
:
INFO: mpirun ... veeeery long line!!!
INFO: MPI launch after n second(s)
INFO: START of listing processing : time & date
============== start of parallel run ==============
Fortran executable
In the Fortran part, every MPI process (there is one MPI process per "tile") writes its own listing. Once the Fortran executable finished running, MPI will collect the listings from all the processes and add them to the main model listing, ${GEM_exp}_M*. To be able to see which line was written by which process, all these lines are preceded by the number of the process, for example:
oe-00000-00000:
oe-00000-00001:
oe-00000-00002:
:
Sometimes the model crashes so badly, that MPI is not able to gather listings from all of the processes. If this happens you can find the listings of the processes under:
INFO: temporary listings for all members in directory_name
You can find the line above in your model listing!
In the directory 'directory_name' you have one directory per process, ?????, which contains the listing of said process.
If the model stopped in the Fortran executable, most of the time you can find an error message at the end of the listing of process 0. To get there jump to the end of the listing and then search backwards to the end of the listing of the main process. (When using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G' and then search upward with '?00000:'). But even from the end of the main model listing you might still have to look several lines up to find an error. However, once you reached a line saying:
THE TIME STEP n IS COMPLETED
there is probably no error above anymore. In that case you will have to look into the listings of all the other processes.
For a large grid you might have use a lot of MPI processes and it is not easy to find an error message in them. Try looking for the word "Traceback". If you are luck this takes you exactly to the traceback of the error.
If the executable started running but was not able to finish the first timestep, meaning if you do not have at least one line saying:
THE TIME STEP n IS COMPLETED
it is possible that there was a problem reading the restart files. Check in the listings of all processes if you find a message like:
oe-00000-...: forrtl: severe (24): end-of-file during read, unit 999, file .../gem_restart
If you find such a line you need to restart the simulation from the previous restart file.
If you find no error message at all there might have been a problem with the machine and the MPI processes could not all get started or that your restart files are corrupted but you can't see it. If this happens for the first time for a given month you can just restart the simulation. But if this happens more than once I would restart the simulation from the previous restart file, assuming there is a problem with the restart files.
If the executable stopped somewhere in the middle and you cannot find an error message but the last line of the listing of the main process says:
OUT_DYN- WRITING ...
or
OUT_PHY- WRITING ...
chances are the model got stuck while writing the output. In that case it might be enough to restart the simulation.
However, it might be safer to remove any output files of the previous attempt. You can find them under:
~/MODEL_EXEC_RUN/beluga/${GEM_exp}/RUNMOD/output/cfg_0000/*step*/???-???/*
In case you are not running full months make sure to only remove output from the job that stopped and not from any previous job!!!
But you can of course always check out listings of ALL processes.
If the model stops more than once at the same timestep have a look at the listings of ALL processes to see what went wrong.
If the model stops while writing the restart files you always have to restart the simulation from the previous month.
Meaning if you have the line in the listing saying:
oe-00000-00000: WRITING A RESTART FILE AT TIMESTEP # ...
But not the line saying:
oe-00000-00000: GEM_CTRL: END OF CURRENT TIME SLICE AT TIMESTEP ...
When the Fortran executable finishes fine, you will see the following messages at the end of the main process listing:
oe-00000-00000: Memory==> ...
:
oe-00000-00000: __________________TIMINGS ON PE #0_________________________
:
oe-00000-00000: .........RESTART
And then a big '****' box with an "END EXECUTION" inside.
Common error messages of the main model and their meaning:
Traceback
If you find a 'Traceback' you will hopefully also get some lines similar to the following:
oe-00000-00000: Image PC Routine Line Source
oe-00000-00000: maingemdm 0000000002287C5B Unknown Unknown Unknown
oe-00000-00000: maingemdm 0000000000EB8391 fpe_handler_ 52 ifort_fpe_handler.F90
oe-00000-00000: maingemdm 0000000002272607 Unknown Unknown Unknown
oe-00000-00000: libpthread-2.30.s 000014B7BFE2C0F0 Unknown Unknown Unknown
oe-00000-00000: maingemdm 0000000000AE51E3 lightning_lpi_ 80 lightning_lpi.F90
oe-00000-00000: maingemdm 00000000009169A4 calcdiag_mp_calcd 546 calcdiag.F90
oe-00000-00000: maingemdm 0000000000800DA8 phyexe_ 153 phyexe.F90
:
oe-00000-00000: maingemdm 00000000004956EE gem_ctrl_ 42 gem_ctrl.F90
oe-00000-00000: maingemdm 000000000041649C gemdm_ 55 gemdm.F90
oe-00000-00000: maingemdm 0000000000416370 MAIN__ 2 maingemdm.F90
Starting from the top, look for the first source files that is part of the CRCM-GEM source code. In the example above, the simulation crashed (most probably) in line '80
' of the source code file 'lightning_lpi.F90
' .
Dimensions differ from previous specification
oe-00000-00000: ERROR: gmm_create, requested dimensions differ from previous specification (res
oe-00000-00000: tart/create)
oe-00000-00000: ERROR: gmm_create, variable name ="XTH "
=> Possible reason: MPI-tiles too small
Bad canopy iteration temperature
oe-00000-00071: 5301.31 384.41 315.63 1100.35 0.00 4224.56 234.27 13.73 0.00
oe-00000-00071: 373.24 281.25 273.15
oe-00000-00071: 0******** END TSOLVC ************************************************************************ -2
Crash in aprep.f
Crash in line with a division by 'THPOR'
If your simulation is crashing in a line with a division by 'THPOR' make sure the number of SAND and CLAY levels you set to be read in your 'physics_input_table' corresponds to the actual number of levels in your geophysical fields.
Crash after restart
If job was restarted from the restart file check the following:
1) Check if the permanent bus is still the same as before. To do that you can compare the current listing with the crash with the previous one (that should be archived in ${CLIMAT_archdir}/Listings/listings_....zip) with 'xxdiff'. On Narval you will have to load 'module add difftools' to get access to xxdiff. The permanent bus will change if the executable was changed in a way that fields got added or removed from the permanent bus.
2) Did you add any fields to outcfg.out? There are certain output fields that cannot get added to outcfg.out once a simulations started. Or, if these fields were present from the start, no other fields can get added. However, one can replace one output field with another. Below is a list of these special fields:
Group 1: 'clse', 'cec ', 'cecm', 'ced ', 'cedm', 'cep ', 'cepm', 'cem ', 'cemm', 'cer ', 'cerm', 'ces ', 'cesm', 'cqt ', 'cqc ', 'cqcm', 'cqd ', 'cqdm', 'cqp ', 'cqpm', 'cqm ', 'cqmm', 'cqr ', 'cqrm', 'cqs ', 'cqsm', 'cey ', 'ceym', 'cef ', 'cefm', 'cqy ', 'cqym', 'cqf ', 'cqfm'
Group 2: 'fdac', 'fdre'
ERROR: key object has dimensions smaller than value assigned - itf_phy/PHYOUT
Same as point 2) above!
Second shell part
The second shell part starts with the lines:
============== end of parallel run ==============
INFO: END of listing processing : date & time
However, sometimes there can be an error message like:
INFO: END of listing processing : date & time
INFO: RUN FAILED
INFO: first 10 failing processes :
fail.00000-... fail.00000-...
If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.
Chunk_job listing (cjob_* or pjob_*)
If you cannot find any error message in the model listing, check the listing ending on *.s. When having submitted the simulation with Chunk_lance have a look at the listing 'cjob_*.s'. When all went "well", this listing will be empty. But sometimes you can find messages in these files like the following:
Node failure
=> Obviously a problem with a node. Just resubmit (continue) your simulation.
Time limit exceeded
=> Your job ran out of time. If your jobs usually fit in the wall time you asked for this might be due to slow access to the filesystems. In this case you can wait until the filesystem problems have been solved or just resubmit and hope for the best. You can also ask for more walltime (BACKEND_time_mod) or run less days per job (Fcst_rstrt_S).
If you just started your simulations your should either ask for more walltime (BACKEND_time_mod) and/or run less days per job (Fcst_rstrt_S).
Out of memory
=> You job ran out of memory. Ask for more MPI tiles (GEM_ptopo). You could also ask for more memory (BACKEND_cm) but this usually means that your jobs will be queued for much longer.
Ajouter un commentaire