Sommaire |
---|
How to find where/why a simulation crashed/stopped
...
Find the last listing
If you started your simulation with 'Chunk_lance' you have two possibilities to check how far advanced your simulation is. If you are using 'Um_lance' you can only use the second option.
1) Check file 'chunk_job.log' - only exists when submitting with 'Chunk_lance'
In your config file directory, have a look at the last lines of your file 'chunk_job.log'. I usually list them with:
tail chunk_job.log
Go into your listings directory
...
cd (~/listings/${TRUE_HOST}
For example on Beluga/Narval:
cd ~/listings/Beluga
resp.
cd ~/listings/Narval
List all the listings of the month that failed chronologically
$TRUE_HOST). You can do that by executing the following alias - if existing:
lis
If in the last line in chunk_job.log is written '... started ...' search for the last listing that starts(!) with the job name written in the last line. For example with:
ls -lrt model_job_M*
If in the last line in chunk_job.log is written '... finished ...' search for the last 'cjob_*' respectively 'pjob_*' listing. For example with:
ls -lrt ?job_experiment_name_*
Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).
2) Check the listings
Go into your listings directory (~/listings/$TRUE_HOST). You can do that by executing the following alias - if existing:
lis
List all script and model listings of your simulation chronologically. For example with:
ls -lrt experiment_name_?????? ls -lrt ${GEM_exp}_[MS][_.]*
...
Where 'experiment_name' is the base name of your simulation (without the YYYYMM at the end!).
The last one will be the one in which the simulation failed.
Open the last listing in your editor or with 'less'
If the model stopped in the ...
...
b) Model listing ${GEM_exp}_M*
Each model job consists of 3 main parts:
- It starts with a shell code,
- followed by the Fortran executable,
- followed by another shell part.
Below are a few different suggestions to find why the model crashed. I usually try them one after the other until I find the problem:
- Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
...
Each model job consists of 3 main parts:
- It starts with a shell code,
- followed by the Fortran executable,
- followed by another shell part.
- and search upwards for '00000:'. This will bring you to the last line of the listing of the main process. From there look upwards if you find anything out of the ordinary. The error might be several line up!
- Search case insensitive(!) for 'Traceback'.
- Search case insensitive(!) for 'ABORT'.
c) Chunk_lance listing cjob_* or pjob_*
If you submitted your simulation with 'Chunk_lance' you will also have listings starting with 'cjob_*' respectively 'pjob_*'. These listings include the calls to the scripts and model above.
- Jump to the end of the listings (when using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G')
- Search upwards until you find an error message
Common error messages and how to interpret them
First shell part
Every model listing start with this shell part.
Epoll ADD, ORTE, MPI launch failed
...
Common error messages in this part that point to a problem with the machine are:
...
Volet |
---|
: |
...
Fortran executable
In the Fortran part, every MPI process (there is one MPI process per "tile") writes its own listing. Once the Fortran executable finished running, MPI will collect the listings from all the processes and add them to the main model listing, ${GEM_exp}_M*. To be able to see which line was written by which process, all these lines are preceded by the number of the process, for example:
...
INFO: temporary listings for all members in directory_name
You can find the line above in your model listing!
In the directory 'directory_name' you have one directory per process, ?????, which contains the listing of said process.
...
If the model stopped in the Fortran executable, most of the time you can find an error message at the end of the listing of process 0. To get there jump to the end of the listing and then search backwards to the end of the listing of the main process. (When using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G' and then search upward with '?00000:'). But even from the end of the main model listing you might still have to look several lines up to find an error. However, once you reached a line saying:
THE TIME STEP n IS COMPLETED
there is probably no error above anymore. In that case you will have to look into the listings of all the other processes.
For a large grid you might have use a lot of MPI processes and it is not easy to find an error message in them. Try looking for the word "Traceback". If you are luck this takes you exactly to the traceback of the error.
If the executable started running but was not able to finish the first timestep, meaning if you do not have at least one line saying:
...
it is possible that there was a problem reading the restart files. Check in the listings of all processes if you find a message like:
oe-00000-...: forrtl: severe (24): end-of-file during read, unit 999, file .../gem_restart
If you find such a line you need to restart the simulation from the previous restart file.
If you find no error message at all there might have been a problem with the machine problem and the MPI processes could not all get started or that your restart files are corrupted but you can't see it. If this happens for the first time for a given month you can just restart the simulation. But if this happens more than once I would restart the simulation from the previous monthrestart file, assuming there is a problem with the restart files.
...
If the model stops more than once at the same timestep have a look at the listings of ALL processes to see what went wrong.
If the model stops while writing the restart files you always have to restart the simulation from the previous month.
Meaning if you have the line in the listing saying:
oe-00000-00000: WRITING A RESTART FILE AT TIMESTEP # ...
But not the line saying:
oe-00000-00000: GEM_CTRL: END OF CURRENT TIME SLICE AT TIMESTEP ...
When the Fortran executable finishes fine, you will see the following messages at the end of the main process listing:
oe-00000-00000: Memory==> ...
:
oe-00000-00000: __________________TIMINGS ON PE #0_________________________
:
oe-00000-00000: .........RESTART
And then a big '****' box with an "END EXECUTION" inside.
Common error messages of the main model and their meaning:
Traceback
If you find a 'Traceback' you will hopefully also get some lines similar to the following:
Volet |
---|
oe-00000-00000: Image PC Routine Line Source oe-00000-00000: maingemdm 0000000002287C5B Unknown Unknown Unknown oe-00000-00000: maingemdm 0000000000EB8391 fpe_handler_ 52 ifort_fpe_handler. |
...
F90 oe-00000-00000: maingemdm 0000000002272607 Unknown Unknown Unknown oe-00000-00000: libpthread-2.30.s 000014B7BFE2C0F0 Unknown Unknown Unknown oe-00000-00000: maingemdm 0000000000AE51E3 lightning_lpi_ 80 lightning_lpi.F90 oe-00000-00000: maingemdm 00000000009169A4 calcdiag_mp_calcd 546 calcdiag.F90 oe-00000-00000: maingemdm 0000000000800DA8 phyexe_ 153 phyexe.F90 : oe-00000-00000: maingemdm 00000000004956EE gem_ctrl_ 42 gem_ctrl.F90 oe-00000-00000: maingemdm 000000000041649C gemdm_ 55 gemdm.F90 oe-00000-00000: maingemdm 0000000000416370 MAIN__ 2 maingemdm.F90 |
Starting from the top, look for the first source files that is part of the CRCM-GEM source code. In the example above, the simulation crashed (most probably) in line '80
' of the source code file 'lightning_lpi.F90
' .
Dimensions differ from previous specification
Volet |
---|
oe-00000-00000: size(pp, 1 )= 71280 high= 73062 low= 1 oe-00000-00000: ERROR: gmm_create, requested dimensions differ from previous specification (res oe-00000-00000: tart/create) oe-00000-00000: ERROR: gmm_create, variable name ="XTH " |
=> Possible reason: MPI-tiles too small
Bad canopy iteration temperature
Volet |
---|
oe-00000-00071: 0BAD CANOPY ITERATION TEMPERATURE 4 51 373.24 6 1 oe-00000-00071: 5301.31 384.41 315.63 1100.35 0.00 4224.56 234.27 13.73 0.00 oe-00000-00071: 373.24 281.25 273.15 oe-00000-00071: 0******** END TSOLVC ************************************************************************ -2 |
Crash in aprep.f
Crash in line with a division by 'THPOR'
If your simulation is crashing in a line with a division by 'THPOR' make sure the number of SAND and CLAY levels you set to be read in your 'physics_input_table' corresponds to the actual number of levels in your geophysical fields.
Crash after restart
If job was restarted from the restart file check the following:
1) Check if the permanent bus is still the same as before. To do that you can compare the current listing with the crash with the previous one (that should be archived in ${CLIMAT_archdir}/Listings/listings_....zip) with 'xxdiff'. On Narval you will have to load 'module add difftools' to get access to xxdiff. The permanent bus will change if the executable was changed in a way that fields got added or removed from the permanent bus.
2) Did you add any fields to outcfg.out? There are certain output fields that cannot get added to outcfg.out once a simulations started. Or, if these fields were present from the start, no other fields can get added. However, one can replace one output field with another. Below is a list of these special fields:
Group 1: 'clse', 'cec ', 'cecm', 'ced ', 'cedm', 'cep ', 'cepm', 'cem ', 'cemm', 'cer ', 'cerm', 'ces ', 'cesm', 'cqt ', 'cqc ', 'cqcm', 'cqd ', 'cqdm', 'cqp ', 'cqpm', 'cqm ', 'cqmm', 'cqr ', 'cqrm', 'cqs ', 'cqsm', 'cey ', 'ceym', 'cef ', 'cefm', 'cqy ', 'cqym', 'cqf ', 'cqfm'
Group 2: 'fdac', 'fdre'
ERROR: key object has dimensions smaller than value assigned - itf_phy/PHYOUT
Same as point 2) above!
...
Second shell part
The second shell part starts with the lines:
...
If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.
Chunk_job listing (cjob_* or pjob_*)
If you cannot find any error message in the model listing, check the listing ending on *.s. When having submitted the simulation with Chunk_lance have a look at the listing 'cjob_*.s'. When all went "well", this listing will be empty. But sometimes you can find messages in these files like the following:
Node failure
Volet |
---|
slurmstepd: error: *** JOB 17891032 ON nc20539 CANCELLED AT 2023-06-14T04:50:44 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** |
=> Obviously a problem with a node. Just resubmit (continue) your simulation.
Time limit exceeded
Volet |
---|
slurmstepd: error: *** JOB 13690472 ON nc30342 CANCELLED AT 2023-02-12T00:50:04 DUE TO TIME LIMIT *** |
=> Your job ran out of time. If your jobs usually fit in the wall time you asked for this might be due to slow access to the filesystems. In this case you can wait until the filesystem problems have been solved or just resubmit and hope for the best. You can also ask for more walltime (BACKEND_time_mod) or run less days per job (Fcst_rstrt_S).
If you just started your simulations your should either ask for more walltime (BACKEND_time_mod) and/or run less days per job (Fcst_rstrt_S).
Out of memory
Volet |
---|
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=13861528.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. |
=> You job ran out of memory. Ask for more MPI tiles (GEM_ptopo). You could also ask for more memory (BACKEND_cm) but this usually means that your jobs will be queued for much longer.