Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility problem with cnmatrix first occurs in dev172 going from ccs_config_cesm0.0.88 to ...89 #2619

Open
2 of 8 tasks
slevis-lmwg opened this issue Jun 24, 2024 · 10 comments
Assignees
Labels
bug something is working incorrectly science Enhancement to or bug impacting science

Comments

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Jun 24, 2024

A cnmatrix reproducibility problem first occurs in dev172 and more specifically when going from ccs_config_cesm0.0.88 to ccs_config_cesm0.0.89. This is the only difference that I see when comparing 89 to 88:

git diff ccs_config_cesm0.0.88 ccs_config_cesm0.0.89

diff --git a/machines/derecho/config_machines.xml b/machines/derecho/config_machines.xml
index 633fe74..6f9d6fd 100644
--- a/machines/derecho/config_machines.xml
+++ b/machines/derecho/config_machines.xml
@@ -21,11 +21,11 @@
     <MPI_GPU_WRAPPER_SCRIPT>get_local_rank</MPI_GPU_WRAPPER_SCRIPT>
     <PROJECT_REQUIRED>TRUE</PROJECT_REQUIRED>
     <mpirun mpilib="default">
-      <executable>mpiexec</executable>
+      <executable>mpibind</executable>
       <arguments>
         <arg name="label"> --label</arg>
-        <arg name="buffer"> --line-buffer</arg>
-        <arg name="num_tasks" > -n {{ total_tasks }}</arg>
+       <arg name="buffer"> --line-buffer</arg>
+       <arg name="separator"> -- </arg>
       </arguments>
     </mpirun>
     <module_system type="module" allow_error="true">

Originally posted by @slevis-lmwg in #640 (comment)

Definition of Done:

  • Add warning to build-namelist to abort when matrix is on and OMP_NUM_THREADS > 1 and test it
  • Have our tests setup to ignore the warnings for our tests for this configuration and test it
  • Add a test to the build_namelist_unit_tester with OMP_NUM_THREADS = 2 and verify that it aborts as intended
  • Use NTHRDS_LND rather than OMP_NUM_THREADS
  • Fix CN Matrix code so that OMP_NUM_THREADS env variable is NOT used, but use values from the NUOPC cap
  • Make sure that fixes it -- and if not track down the problem
  • Fix the real problem
  • Remove the warning in the namelist
@ekluzek ekluzek added next this should get some attention in the next week or two. Normally each Thursday SE meeting. tag: bug - impacts science labels Jun 24, 2024
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jun 24, 2024

The original aux_clm failure:
ERP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn BASELINE ctsm5.2.005.cn-matrix_n09: DIFF
The same test without --clm-matrixcnOn passed and the two tests differ ONLY as follows:

<  hist_fincl1 = 'TRAFFICFLUX','SNOWLIQ:A','SNOWICE:A','FCO2'
---
>  hist_fincl1 = 'TRAFFICFLUX', 'SNOWLIQ:A', 'SNOWICE:A', 'FCO2', 'SNO_EXISTENCE', 'SNO_ABS', 'SNO_T:M', 'SNO_GS:X'
33,34c33,34
<  hist_nhtfrq = -24,-8
<  hist_wrt_matrixcn_diag = .true.
---
>  hist_nhtfrq = 0,-240
>  hist_wrt_matrixcn_diag = .false.

@ekluzek explains that this hist_wrt_matrixcn_diag output is important for SASU spin-ups.

We changed the ERP test to a REP test to show that this is a threading reproducibility problem:

FAIL REP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn COMPARE_base_rep2
PASS REP_P128x1_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn

This is the failure that first appears in dev172 with ccs_config_cesm0.0.89.

From #640 testing/troubleshooting on this issue:
git grep matrixcnOn | grep 'test name' | grep P64x2 on the testlist and try these tests as REP tests. Change the Lm13 test (that we know fails) with Ld13 to see if shorter also fails. They all passed.

PASS REP_P64x2_Ld10_D.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-ciso_decStart--clm-matrixcnOn
PASS REP_P64x2_Ld10_D.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-default--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.clm-default--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm60BgcCrop.derecho_intel.clm-coldStart--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel.clm-flexCN_FUN--clm-matrixcnOn
PASS REP_P64x2_Ld3_D.f10_f10_mg37.I2000Clm50BgcCru.derecho_intel.clm-noFUN_flexCN--clm-matrixcnOn
PASS REP_P64x2_Ly3.f10_f10_mg37.IHistClm60BgcCrop.derecho_intel.clm-cropMonthOutput--clm-matrixcnOn
PASS REP_P64x2_Ld5_D.f10_f10_mg37.I1850Clm50Bgc.derecho_intel.clm-ciso--clm-matrixcnOn
PASS REP_P64x2_Ld13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn
PASS REP_P64x2_D.f10_f10_mg37.I1850Clm50BgcCrop.derecho_intel.clm-default--clm-matrixcnOn

@ekluzek
Copy link
Collaborator

ekluzek commented Jun 24, 2024

The change has to do with how MPI is binding processors on Derecho. This must have some influence on how the code is running for threading. There were problems with the code before the binding was done, which might mean that it passed before, because threading wasn't working correctly. And now we are seeing bad behavior with threading, but it's probably because threading is actually working correctly with this update.

#2296 #2289

Also links to earlier mentions of potential threading problems in the cnmatrix PR:
#640 (comment)
#640 (comment)

@slevis-lmwg slevis-lmwg changed the title Reproducibility problem with cnmatrix first occurs in dev172 going from ccs_config_cesm0.0.88 to ccs_config_cesm0.0.89 Reproducibility problem with cnmatrix first occurs in dev172 going from ccs_config_cesm0.0.88 to ...89 Jun 24, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Jun 24, 2024

Note there are some issues with threading for CN-Matrix.

  1. Threading code uses different calls than without it
  2. Matrix is using the OMP_NUM_THREADS env variable rather than what's passed down from the NUOPC cap

The first shows potential issues in how the code was implemented, but not likely something easy to fix now. The second is something that it seems we should fix though.

@ekluzek ekluzek removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jul 8, 2024
@slevis-lmwg slevis-lmwg self-assigned this Jul 9, 2024
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jul 11, 2024

@ekluzek I committed our changes (commit 4ca2b6a), but they do not work. I troubleshooted a bit, and all I could get was: out=NTHRDS_LND = 0

Testing with:
./create_test REP_P64x2_Ld13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jul 11, 2024

@ekluzek by looking elsewhere in the code and through repeated trial and error, I have managed to solve this to my pleasant surprise.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jul 12, 2024

I'm working on the next TODO item and having no luck. I pushed my attempt to 8e618bb in case @ekluzek you have suggestions.

The error says

#   Failed test 'matrixcnOn_with_threading'
#   at ./build-namelist_test.pl line 1393.
#          got: '0'
#     expected: anything else

@ekluzek
Copy link
Collaborator

ekluzek commented Jul 15, 2024

  1. So the problem is that OMP_NUM_THREADS is a proper UNIX environment variable and NOT part of the env_run.xml file. So it has to be treated in a different way from the env_*.xml file variables.

  2. The other problem with using NTHRDS_LND is that it's implemented in a more complex way in the XML file with a full entry like this:

    <entry id="NTHRDS">
      <type>integer</type>
      <values>
        <value compclass="ATM">1</value>
        <value compclass="CPL">1</value>
        <value compclass="LND">1</value>
.
.
.
        <value compclass="ESP">1</value>
      </values>
      <desc>number of threads for each task in each component</desc>
    </entry>

It's treated as NTHRDS_LND by xmlchange and xmlquery, but the perl unit tester AND the perl build-namelist code would need to implement the above syntax in order to do this correctly. This likely would take a bit to figure this out. So dropping for now.

We figure we will want to implement the first option so that the unit tester is testing this capability. But, it will take a little more finagling than we want to spend now. But, is something that's straightforward to do (at least for me).

@ekluzek
Copy link
Collaborator

ekluzek commented Jul 26, 2024

In PR #2545 I saw a test fail in comparison baseline that is because of this issue see this comment:

#2545 (comment)

The test is:

ERP_P64x2_Lm13.f10_f10_mg37.IHistClm60Bgc.derecho_intel.clm-monthly--clm-matrixcnOn_ignore_warnings

so we should expect this same thing to come up in baseline testing for future tags as well.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Jul 26, 2024
@wwieder wwieder removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Aug 8, 2024
@wwieder
Copy link
Contributor

wwieder commented Aug 8, 2024

@slevis-lmwg is this something that we can fix by the beta04 release (mid-Sept timeline)?

@samsrabin samsrabin added bug something is working incorrectly science Enhancement to or bug impacting science and removed bug - impacts science labels Aug 8, 2024
@ekluzek ekluzek added this to the ctsm6.0.0 (code freeze) milestone Sep 9, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Sep 9, 2024

@wwieder this isn't something we really need now, but I would like to check off the next three boxes in the checklist at the top by ctsm6.0.0. So I put it in for that milestone. This is also something that I probably have the best idea on what needs to be done, once I can get to it. So assigned to me makes sense.

@slevis-lmwg slevis-lmwg removed their assignment Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly science Enhancement to or bug impacting science
Projects
Status: Todo
Development

No branches or pull requests

4 participants