...
If the model stopped in the Fortran executable, most of the time you can find an error message at the end of the listing of process 0. To get there jump to the end of the listing and then search backwards to the end of the listing of the main process. (When using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G' and then search upward with '?00000:'). But even from the end of the main model listing you might still have to look several lines up to find an error. However, once you reached a line saying:
THE TIME STEP n IS COMPLETED
there is probably no error above anymore. In that case you will have to look into the listings of all the other processes.
For a large grid you might have use a lot of MPI processes and it is not easy to find an error message in them. Try looking for the word "Traceback". If you are luck this takes you exactly to the traceback of the error.
If the executable started running but was not able to finish the first timestep, meaning if you do not have at least one line saying:
...
it is possible that there was a problem reading the restart files. Check in the listings of all processes if you find a message like:
oe-00000-...: forrtl: severe (24): end-of-file during read, unit 999, file .../gem_restart
If you find such a line you need to restart the simulation from the previous restart file.
If you find no error message at all there might have been a problem with the machine machine problem and the MPI processes could not all get started or that your restart files are corrupted but you can't see it. If this happens for the first time for a given month you can just restart the simulation. But if this happens more than once I would restart the simulation from the previous monthrestart file, assuming there is a problem with the restart files.
...
If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.
Problem with memory, time or node
If you cannot find any error message in the model listing, check the listing ending on *.s. When having submitted the simulation with Chunk_lance have a look at the listing 'cjob_*.s'. When all went "well", this listing will be empty. But sometimes you can find messages in these files like the following:
Volet |
---|
slurmstepd: error: *** JOB 17891032 ON nc20539 CANCELLED AT 2023-06-14T04:50:44 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** |
=> Obviously a problem with a node. Just resubmit (continue) your simulation.
Volet |
---|
slurmstepd: error: *** JOB 13690472 ON nc30342 CANCELLED AT 2023-02-12T00:50:04 DUE TO TIME LIMIT *** |
=> Your job ran out of time. If your jobs usually fit in the wall time you asked for this might be due to slow access to the filesystems. In this case you can wait until the filesystem problems have been solved or just resubmit and hope for the best. You can also ask for more walltime (BACKEND_time_mod) or run less days per job (Fcst_rstrt_S).
If you just started your simulations your should either ask for more walltime (BACKEND_time_mod) and/or run less days per job (Fcst_rstrt_S).
Volet |
---|
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=13861528.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. |
=> You job ran out of memory. Ask for more MPI tiles (GEM_ptopo). You could also ask for more memory (BACKEND_cm) but this usually means that your jobs will be queued for much longer.
Common error messages and their meanings
1) Message:
Volet |
---|
oe-00000-00000: size(pp, 1 )= 71280 high= 73062 low= 1 oe-00000-00000: ERROR: gmm_create, requested dimensions differ from previous specification (res oe-00000-00000: tart/create) oe-00000-00000: ERROR: gmm_create, variable name ="XTH " |
=> Possible reason: MPI-tiles too small
2) CLASS
a) BAD CANOPY ITERATION TEMPERATURE
Volet |
---|
oe-00000-00071: 0BAD CANOPY ITERATION TEMPERATURE 4 51 373.24 6 1 oe-00000-00071: 5301.31 384.41 315.63 1100.35 0.00 4224.56 234.27 13.73 0.00 oe-00000-00071: 373.24 281.25 273.15 oe-00000-00071: 0******** END TSOLVC ************************************************************************ -2 |
b) Crash in aprep.f
If job was restarted from the restart file make sure that the permanent bus is still the same as before. To do that you can compare the current listing with the crash with the previous one (that should be archived in ${CLIMAT_archdir}/Listings/listings_....zip) with 'xxdiff'. On Narval you will have to load 'module add difftools' to get access to xxdiff.