...
If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.
Problem with memory, time or node
If you cannot find any error message in the model listing, check the listing ending on *.s. When having submitted the simulation with Chunk_lance have a look at the listing 'cjob_*.s'. When all went "well", this listing will be empty. But sometimes you can find messages in these files like the following:
Volet |
---|
slurmstepd: error: *** JOB 17891032 ON nc20539 CANCELLED AT 2023-06-14T04:50:44 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *** |
=> Obviously a problem with a node. Just resubmit (continue) your simulation.
Volet |
---|
slurmstepd: error: *** JOB 13690472 ON nc30342 CANCELLED AT 2023-02-12T00:50:04 DUE TO TIME LIMIT *** |
=> Your job ran out of time. If your jobs usually fit in the wall time you asked for this might be due to slow access to the filesystems. In this case you can wait until the filesystem problems have been solved or just resubmit and hope for the best. You can also ask for more walltime (BACKEND_time_mod) or run less days per job (Fcst_rstrt_S).
If you just started your simulations your should either ask for more walltime (BACKEND_time_mod) and/or run less days per job (Fcst_rstrt_S).
Volet |
---|
slurmstepd: error: Detected 3 oom-kill event(s) in StepId=13861528.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. |
=> You job ran out of memory. Ask for more MPI tiles (GEM_ptopo). You could also ask for more memory (BACKEND_cm) but this usually means that your jobs will be queued for much longer.
Common error messages and their meanings
1) Message:
Volet |
---|
oe-00000-00000: size(pp, 1 )= 71280 high= 73062 low= 1 oe-00000-00000: ERROR ERROR: gmm_create, requested dimensions differ from previous specification (res oe-00000-00000: tart/create) oe-00000-00000: ERROR: gmm_create, variable name ="XTH " |
=> Possible reason: MPI-tiles too small
2) CLASS
Volet |
---|
oe-00000-00071: 0BAD CANOPY ITERATION TEMPERATURE 4 51 373.24 6 1 oe-00000-00071: 5301.31 384.41 315.63 1100.35 0.00 4224.56 234.27 13.73 0.00 oe-00000-00071: 373.24 281.25 273.15 oe-00000-00071: 0******** END TSOLVC ************************************************************************ -2 |