How to find where/why a simulation crashed/stopped
1) Check if the model stopped in the scripts or model job
Go into your listings directory
cd ~/listings/${TRUE_HOST}
For example on Beluga/Narval:
cd ~/listings/Beluga
resp.
cd ~/listings/Narval
List all the listings of the month that failed chronologically
ls -lrt ${GEM_exp}_[MS][_.]*
2) Open the last listing in your editor or with 'less'
If the model stopped in the ...
a) Scripts listing ${GEM_exp}_S*
- Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
- Search upwards until you find an error message
b) Model listing ${GEM_exp}_M*
Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
Each model job consists of 3 main parts:
- It starts with a shell code,
- followed by the Fortran executable,
- followed by another shell part.
1) First shell part:
Common error messages in this part that point to a problem with the machine are:
- [warn] Epoll ADD(...) on fd n failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
- ORTE has lost communication with a remote daemon.
- An ORTE daemon has unexpectedly failed after launch and before ...
- An internal error has occurred in ORTE:
- ERROR: MPI launch failed after n seconds, killing process m
If you find any of the above error messages in your model listing you can just restart the simulation.
If this part stops with other messages, and you do not understand what they mean, search for the message in your scripts under:
~/modeles/GEMDM/version/bin
If all goes well, the first shell part ends with:
:
INFO: mpirun ... veeeery long line!!!
INFO: MPI launch after n second(s)
INFO: START of listing processing : time & date
============== start of parallel run ==============
2) Fortran executable
In the Fortran part, every MPI process (there is one MPI process per "tile") writes its own listing. Once the Fortran executable finished running, MPI will collect the listings from all the processes and add them to the main model listing, ${GEM_exp}_M*. To be able to see which line was written by which process, all these lines are preceded by the number of the process, for example:
oe-00000-00000:
oe-00000-00001:
oe-00000-00002:
:
Sometimes the model crashes so badly, that MPI is not able to gather listings from all of the processes. If this happens you can find the listings of the processes under:
INFO: temporary listings for all members in directory_name
You can find the line above in your model listing!
In the directory 'directory_name' you have one directory per process, ?????, which contains the listing of said process.
If the model stopped in the Fortran executable, most of the time you can find an error message at the end of the listing of process 0. To get there jump to the end of the listing and then search backwards to the end of the listing of the main process. (When using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G' and then search upward with '?00000:'). But even from the end of the main model listing you might still have to look several lines up to find an error. However, once you reached a line saying:
THE TIME STEP n IS COMPLETED
there is probably no error above anymore. In that case you will have to look into the listings of all the other processes.
For a large grid you might have use a lot of MPI processes and it is not easy to find an error message in them. Try looking for the word "Traceback". If you are luck this takes you exactly to the traceback of the error.
If the executable started running but was not able to finish the first timestep, meaning if you do not have at least one line saying:
THE TIME STEP n IS COMPLETED
it is possible that there was a problem reading the restart files. Check in the listings of all processes if you find a message like:
oe-00000-...: forrtl: severe (24): end-of-file during read, unit 999, file .../gem_restart
If you find such a line you need to restart the simulation from the previous restart file.
If you find no error message at all there might have been a problem with the machine and the MPI processes could not all get started or that your restart files are corrupted but you can't see it. If this happens for the first time for a given month you can just restart the simulation. But if this happens more than once I would restart the simulation from the previous restart file, assuming there is a problem with the restart files.
If the executable stopped somewhere in the middle and you cannot find an error message but the last line of the listing of the main process says:
OUT_DYN- WRITING ...
or
OUT_PHY- WRITING ...
chances are the model got stuck while writing the output. In that case it might be enough to restart the simulation.
However, it might be safer to remove any output files of the previous attempt. You can find them under:
~/MODEL_EXEC_RUN/beluga/${GEM_exp}/RUNMOD/output/cfg_0000/*step*/???-???/*
In case you are not running full months make sure to only remove output from the job that stopped and not from any previous job!!!
But you can of course always check out listings of ALL processes.
If the model stops more than once at the same timestep have a look at the listings of ALL processes to see what went wrong.
If the model stops while writing the restart files you always have to restart the simulation from the previous month.
Meaning if you have the line in the listing saying:
oe-00000-00000: WRITING A RESTART FILE AT TIMESTEP # ...
But not the line saying:
oe-00000-00000: GEM_CTRL: END OF CURRENT TIME SLICE AT TIMESTEP ...
When the Fortran executable finishes fine, you will see the following messages at the end of the main process listing:
oe-00000-00000: Memory==> ...
:
oe-00000-00000: __________________TIMINGS ON PE #0_________________________
:
oe-00000-00000: .........RESTART
And then a big '****' box with an "END EXECUTION" inside.
3) Second shell part
The second shell part starts with the lines:
============== end of parallel run ==============
INFO: END of listing processing : date & time
However, sometimes there can be an error message like:
INFO: END of listing processing : date & time
INFO: RUN FAILED
INFO: first 10 failing processes :
fail.00000-... fail.00000-...
If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.
Problem with memory, time or node
If you cannot find any error message in the model listing, check the listing ending on *.s. When having submitted the simulation with Chunk_lance have a look at the listing 'cjob_*.s'. When all went "well", this listing will be empty. But sometimes you can find messages in these files like the following:
=> Obviously a problem with a node. Just resubmit (continue) your simulation.
=> Your job ran out of time. If your jobs usually fit in the wall time you asked for this might be due to slow access to the filesystems. In this case you can wait until the filesystem problems have been solved or just resubmit and hope for the best. You can also ask for more walltime (BACKEND_time_mod) or run less days per job (Fcst_rstrt_S).
If you just started your simulations your should either ask for more walltime (BACKEND_time_mod) and/or run less days per job (Fcst_rstrt_S).
=> You job ran out of memory. Ask for more MPI tiles (GEM_ptopo). You could also ask for more memory (BACKEND_cm) but this usually means that your jobs will be queued for much longer.
Common error messages and their meanings
1) Message:
oe-00000-00000: ERROR: gmm_create, requested dimensions differ from previous specification (res
oe-00000-00000: tart/create)
oe-00000-00000: ERROR: gmm_create, variable name ="XTH "
=> Possible reason: MPI-tiles too small
2) CLASS
a) BAD CANOPY ITERATION TEMPERATURE
oe-00000-00071: 5301.31 384.41 315.63 1100.35 0.00 4224.56 234.27 13.73 0.00
oe-00000-00071: 373.24 281.25 273.15
oe-00000-00071: 0******** END TSOLVC ************************************************************************ -2
b) Crash in aprep.f
If job was restarted from the restart file make sure that the permanent bus is still the same as before. To do that you can compare the current listing with the crash with the previous one (that should be archived in ${CLIMAT_archdir}/Listings/listings_....zip) with 'xxdiff'. On Narval you will have to load 'module add difftools' to get access to xxdiff.
Ajouter un commentaire