It’s never easy to switch machines when you’re using supercomputers for research. This week, I made the move from NERSC’s Cori machine back over to Edison in preparation of a sustained Cori outage as phase II gets integrated with phase I. What should have been a simple matter of changing ‘machine name’ from ‘corip1’ to ‘edison’ quickly became a day-long adventure in debugging.
With my first simulation on Edison, I received a pretty unhelpful segmentation fault error in the cesm log file. After turning on the DEBUG flag in env_build.xml, I got a more useful error message:
“forrtl: severe (408): fort: (2): Subscript #1 of the array ATRCR has value 1 which is greater than the upper bound of 0”
With some googling, I came across this page, https://bb.cgd.ucar.edu/error-running-compset-fsdwsf-ert, which suggests a few changes to the source file ice_itd.F90. Another simulation, and the model ran! It was very slow though, so I turned off debug mode and attempted another run. But then I was back at the same place as I was at the beginning – segmentation fault, with no explanation. Why would the model run with debug mode turned on, but fail with it off?
One suggestion was that the optimization flag for the intel compiler on Edison has some problems; turning debug mode on changes the optimization to -O0, making the simulation much slower but avoiding some problems that could occur with different optimization settings in the compiler. So another simulation was attempted, changing from the default setting of -O2 to -O1 (in the file config_compilers), and thus, finally, I had a simulation that ran all the way through with more reasonable throughput (~13 sim_yr/day).
My updated config_compilers.xml code now looks like this for Edison:
<compiler COMPILER=”intel” MACH=”edison”>
<ADD_FFLAGS DEBUG=”FALSE”> -O1 </ADD_FFLAGS>
<ADD_CFLAGS DEBUG=”FALSE”> -O1 </ADD_CFLAGS>
<CONFIG_ARGS> –host=Linux </CONFIG_ARGS>
<ADD_SLIBS> -L$(NETCDF_DIR) -lnetcdff -Wl,–as-needed,-L$(NETCDF_DIR)/lib -lnetcdff -lnetcdf $(MKL)</ADD_SLIBS>
</compiler>