-
Notifications
You must be signed in to change notification settings - Fork 307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducibility problem with cnmatrix first occurs in dev172 going from ccs_config_cesm0.0.88 to ...89 #2619
Comments
The original aux_clm failure: < hist_fincl1 = 'TRAFFICFLUX','SNOWLIQ:A','SNOWICE:A','FCO2'
---
> hist_fincl1 = 'TRAFFICFLUX', 'SNOWLIQ:A', 'SNOWICE:A', 'FCO2', 'SNO_EXISTENCE', 'SNO_ABS', 'SNO_T:M', 'SNO_GS:X'
33,34c33,34
< hist_nhtfrq = -24,-8
< hist_wrt_matrixcn_diag = .true.
---
> hist_nhtfrq = 0,-240
> hist_wrt_matrixcn_diag = .false. @ekluzek explains that this hist_wrt_matrixcn_diag output is important for SASU spin-ups. We changed the ERP test to a REP test to show that this is a threading reproducibility problem:
This is the failure that first appears in dev172 with ccs_config_cesm0.0.89. From #640 testing/troubleshooting on this issue:
|
The change has to do with how MPI is binding processors on Derecho. This must have some influence on how the code is running for threading. There were problems with the code before the binding was done, which might mean that it passed before, because threading wasn't working correctly. And now we are seeing bad behavior with threading, but it's probably because threading is actually working correctly with this update. Also links to earlier mentions of potential threading problems in the cnmatrix PR: |
Note there are some issues with threading for CN-Matrix.
The first shows potential issues in how the code was implemented, but not likely something easy to fix now. The second is something that it seems we should fix though. |
@ekluzek by looking elsewhere in the code and through repeated trial and error, I have managed to solve this to my pleasant surprise. |
<entry id="NTHRDS">
<type>integer</type>
<values>
<value compclass="ATM">1</value>
<value compclass="CPL">1</value>
<value compclass="LND">1</value>
.
.
.
<value compclass="ESP">1</value>
</values>
<desc>number of threads for each task in each component</desc>
</entry> It's treated as NTHRDS_LND by xmlchange and xmlquery, but the perl unit tester AND the perl build-namelist code would need to implement the above syntax in order to do this correctly. This likely would take a bit to figure this out. So dropping for now. We figure we will want to implement the first option so that the unit tester is testing this capability. But, it will take a little more finagling than we want to spend now. But, is something that's straightforward to do (at least for me). |
In PR #2545 I saw a test fail in comparison baseline that is because of this issue see this comment: The test is: ERP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn_ignore_warnings so we should expect this same thing to come up in baseline testing for future tags as well. |
@slevis-lmwg is this something that we can fix by the beta04 release (mid-Sept timeline)? |
@wwieder this isn't something we really need now, but I would like to check off the next three boxes in the checklist at the top by ctsm6.0.0. So I put it in for that milestone. This is also something that I probably have the best idea on what needs to be done, once I can get to it. So assigned to me makes sense. |
A cnmatrix reproducibility problem first occurs in dev172 and more specifically when going from ccs_config_cesm0.0.88 to ccs_config_cesm0.0.89. This is the only difference that I see when comparing 89 to 88:
git diff ccs_config_cesm0.0.88 ccs_config_cesm0.0.89
Originally posted by @slevis-lmwg in #640 (comment)
Definition of Done:
The text was updated successfully, but these errors were encountered: