Vous regardez une version antérieure (v. /display/EDDSDLTEL/Restart+a+simulation) de cette page.

afficher les différences afficher l'historique de la page

« Afficher la version précédente Vous regardez la version actuelle de cette page. (v. 12) afficher la version suivante »


Restart a simulation when having submitted it with 'Chunk_lance'.

If the simulation was stopped (job was killed) or crashed it can often get re-submitted by executing again the command:

    Chunk_lance

Restart a simulation from a previous restart file

Sometimes it happens that part of the restart files gets overwritten before a month is finished or that the restart files are corrupted or one would simply like to rerun of of a simulation. In that case one cannot just relaunch the month with Chunk_lance but has to get the original (uncorrupted) restart files back (from the month previous to the one one wants to rerun) and restart the simulation from there.

When the model job stops for whatever reason, there is an automatic check in Chunk_lance if the restart files are still the original ones or not. If they got already modified the following message will appear in the "chunk_job listing" (!!!) not in the model listing:

   At least one of the restart files got already rewritten
   Therefore the model could not get restarted automatically
   You have to restart your simulation starting from the previous restart files
            ----- ABORT -----
 
If you see this message one cannot simply restart the simulation with Chunk_lance but as said, one need to restart it from the previous restart file - see below.


To restart a simulation one needs to :

1) Make sure the restart files of the previous month are under ~/MODEL_EXEC_RUN/${TRUE_HOST}

The actual restart files are called:

  • gem_restart  
  • gmm_restart  
  • Whiteboard.ckpt

There is one set for each processor tile (???-???). Depending on the state of the simulation they are in 3 different locations:

a) While the model is running you can find them under:

    ~/MODEL_EXEC_RUN/${TRUE_HOST}/${GEM_exp}/RUNMOD/work/cfg_0000/???-???
resp. for YinYang grids:
    ~/MODEL_EXEC_RUN/${TRUE_HOST}/${GEM_exp}/RUNMOD/work/cfg_0000/YIN/???-???
    ~/MODEL_EXEC_RUN/${TRUE_HOST}/${GEM_exp}/RUNMOD/work/cfg_0000/YAN/???-???

b) Once the model simulation for a month has finished but the post processing did not run yet they are under:

    ~/MODEL_EXEC_RUN/${TRUE_HOST}/Restarts/${GEM_exp}/*step*/*/RUNMOD/work/cfg_0000/...

c) Once the post processing finished the "archiving of the restarts" they will be in your archive under:

    ${CLIMAT_archdir}/Restarts

Depending on the script version you are using they can have different names. Either


${GEM_exp}_step*.ca.gz

Gunzip and unarchive (cmcarc -x -f ...) the restart file(s) in ~/MODEL_EXEC_RUN/$TRUE_HOST
Depending on the grid size there could be more than one file per month. You need to gunzip and unarchive all of them.
The cmcarc-command will create a new directory, but the *.ca file(s) will remain in the directory. You can remove it/them again.


or

${GEM_exp}_step*.tarz

Untar the file like any other compressed tar file. For example with:
    tar xzvf ...

2) Make sure the "Scripts-job", ${GEM_exp}_S, is in the config directory

In case the script-job of the month you want to rerun, ${GEM_exp}_S, is not in your config directory anymore you can find it in your archive under:

       ${CLIMAT_archdir}/Listings/jobs_*.zip

3) Edit the file 'chunk_job.log'

As you hopefully know, each simulated month consists of two scripts:

  • a very short                             Script-job : ${GEM_exp}_S
  • followed by one of more Model-jobs : ${GEM_exp}_M

To know where the simulation is, Chunk_lance is using a log file called 'chunk_job.log'. The chunk_job itself will check this file to determine which job to execute next. Therefore, this log file should usually not be touched. However, to rerun part of a simulation one can alter the log file by hand. Just make sure there is never a blank line at the end of the log file 'chunk_job.log' since the chunk_job only checks the very last line of the log file!

'chunk_job.log' contains entries like the following:

1 simulation starting at Thu Aug 19 15:09:15 EDT 2021
1 scripts LUCAS_NAM-44_orgVeg_ISBA_197901_S finished at Thu Aug 19 15:09:36 EDT 2021
1 model   LUCAS_NAM-44_orgVeg_ISBA_197901_M starting at Thu Aug 19 15:09:36 EDT 2021
1 model   LUCAS_NAM-44_orgVeg_ISBA_197901_M finished at Thu Aug 19 16:04:22 EDT 2021
2 scripts LUCAS_NAM-44_orgVeg_ISBA_197902_S starting at Thu Aug 19 16:04:22 EDT 2021
2 scripts LUCAS_NAM-44_orgVeg_ISBA_197902_S finished at Thu Aug 19 16:04:26 EDT 2021
2 model   LUCAS_NAM-44_orgVeg_ISBA_197902_M starting at Thu Aug 19 16:04:26 EDT 2021
2 model   LUCAS_NAM-44_orgVeg_ISBA_197902_M finished at Thu Aug 19 16:52:18 EDT 2021


Each completed month has at least 4 entries:

..._S starting ...
..._S finished ...
..._M starting ...
..._M finished ...

To restart a month make sure the last line either contains:

    ... previous_month_M finished ...

or

    ... current_month_S starting ...

4) Execute "Chunk_lance" again (without the '-start'!!!)



  • Aucune étiquette