...
For example on Beluga/Narval:
cd ~/listings/Beluga
resp.
cd ~/listings/Narval
List all the listings of the month that failed chronologically
ls -lrt ${GEM_exp}_[MS][_.]*
...
- It starts with a shell code,
- followed by the Fortran executable,
- followed by another shell part.
1) If all goes well, the first First shell part ends with:
Volet |
---|
: |
Common error messages in this part that point to a problem with the machine are:
- [warn] Epoll ADD(...) on fd n failed. Old events were 0; read change was 0 (none); write change was 1 (add): Bad file descriptor
- ORTE has lost communication with a remote daemon.
- An ORTE daemon has unexpectedly failed after launch and before ...
- An internal error has occurred in ORTE:
- ERROR: MPI launch failed after n seconds, killing process m
If you find any of the above error messages in your model listing there was most likely a problem with the machine and you can just restart the simulation.
...
~/modeles/GEMDM/version/bin
If all goes well, the first shell part ends with:
Volet |
---|
: |
2) Fortran executable
In the Fortran part, every MPI process (there is one MPI process per "tile") writes its own listing. Once the Fortran executable (Fortran part) finished running, MPI will collect the listings from all the processes and add them to the main model listing, ${GEM_exp}_M*. To be able to see which line was written by which process, all these lines are preceded by the number of the process, for example:
...
If the model stopped in the Fortran partexecutable, most of the time you can find an error message at the end of the listing of process 0. To get there jump to the end of the listing and then search backwards for to the end of the listing of the main process. (When using 'vi', 'vim' or 'less' you can jump to the end by pressing 'G' and then search upward with '?00000:'). But even from the end of the main model listing you might still have to look several lines up to find an error. However, once you reached a line saying:
THE TIME STEP n IS COMPLETED
there is probably no error above anymore and then . In that case you will have to look into the listings of all the other processes.
...
If the executable stopped somewhere in the middle and you cannot find an error message but the last line of the listing of the main process says:
OUT_DYN- WRITING ...
or
OUT_PHY- WRITING ...
chances are the model got stuck while writing the output. In that case it might be enough to restart the simulation.
However, it is still a good idea to to might be safer to remove any output files of the previous attempt. You can find them under:
~/MODEL_EXEC_RUN/beluga/${GEM_exp}/RUNMOD/output/cfg_0000/*step*/???-???/*
In case you are not running full months make sure to only remove output from the job that stopped and not from any previous job!!!
But you can of course always check out listings of ALL processes.
...
If the model stops more than once at the same timestep have a look at the listings of ALL processes to see what went wrong.
When the Fortran part executable finishes fine, you will see the following messages at the end of the main process listing:
...
Volet |
---|
============== end of parallel run ============== INFO: END of listing processing : date & time INFO: RUN FAILED INFO: first 10 failing processes : fail.00000-... fail.00000-... |
In If the Fortran executable finished well - see section above - you can ignore this "FAILED" message.
...