Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bonus Task: Optimization of Nogtom Module
Changes
Brief Expression For Loops
For nested loops, we use a more simple format. For example:
We use do concurrent to modify the aforementioned loop block into a simple and standardized expression.
For elementwise operations with neat subscripts, such as assignment or calculation at the same position in right and left array, we can have a more concise method. For instance:
We use the modern Fortran method of array slicing assignment to modify the aforementioned loop block into a concise and standardized expression, as follows:
Similarly, for more complex cases of array slicing assignment, such as
We modify it to the following code:
We have verified aforementioned usage of array slicing assignment in
Test/Evaluation2
, confirming the functional correctness of this method.Remove Argsort
From line 1727 to 1790 in
mod_micro_nogtom_old.F90
which is the target blocks, we find out that theargsort
function is actually redundant by analysing the output ofTest/Evaluation1
, which is an experiment on verifing the correctness withoutargsort.dd
By analysing the output of
Test/Evaluation1
, we can also prove that the output ofargsort
, i.e.iorder
is useless among the target code blocks. We can also logically remove it step by step. For example:For the aforementioned code, the operation on
sinksum
recalculatessinksum
in the order ofiorder
. Notice that the assignment of the loop:where
lind2(jo,jn)
andsinksum(jo)
are independent from the order in whichjo
iterates. So we can directly removejo
and modify the block into following code:Since the method of calculating
sinksum
has no change on its value, the calculation turns out to be redundant. The same applies toratio
. Therefore, the target blocks can be simplified tolind2=qsexp<d_zero
, and the calculations forsinksum
andratio
are removed.For the loop:
It can also be proven that this symmetric position assignment method of
qsexp(jo,jn)
andqsexp(jn,jo)
is also independent from the order of row traversaljo = iorder(n)
. Therefore,jo
can be directly simplified ton
, and the code is modified toMultiplication Optimization
As for following block:
For the aforementioned code, we note that in the accumulation processes of
rainh
, there is a expressiondtgdp * pfplsx(n,j,i,k+1) * dp
. Additionally, we find thatdtgdp
equalsdt * egrav / dp
. Therefore, by the commutative property of multiplication,dtgdp * dp
equalsdt * egrav / dp * dp
, which simplifies todt * egrav
.Thus, the calculation of
rainh
can be simplified torainh = rainh + wlhvocp * dt * egrav * pfplsx(n,j,i,k+1)
andrainh = rainh + wlhsocp * dt * egrav * pfplsx(n,j,i,k+1)
.Furthermore, we can remove the temporary variables
rain
andrainh
by directly using the finally assigned variablessumq1
andsumh1
. By applying thedo concurrent
method, we optimized it into final version.In
Test/Evaluation3
, we measured the performance, and this version is 60% faster than the original.Results
For given input file ISC24.in:
Time spent on mod_micro_nogtom.F90 speeds up from 384s to 339s with about12% performance improvement.
And percent of time spent on mod_micro_nogtom.F90 decreases from 30.7% to 28.1%, profiled by Vtune.
Evaluation and Experiment
We do a lot to ensure we get correct answer and better performance.
Evaluation1: In this optimization, we remove function
argsort
and code blocks using the result ofargsort
iorder. We show how it works in mod_micro_nogtom.F90 and for random inputs there exists the same situation which parently shows that functionargsort
can be deleted with no impact on correctness but save 10% time to run this file.Evaluation2: In this optimization, we verify the correctness like
aamax = maxval(abs(qlhs(:,n)))
andThrough the output, we show that the results are correct.
Evaluation3: In this experiment we speed up the block by logical analysis. We will show how it works in mod_micro_nogtom.F90 and test the speedup times. It turns out to be about 60% speedup along with logical optimization and removing local vars on the example blocks. We tried many ways to optimize including divide the independent outcomes to run seperately. Finally we apply a multiplication optimization and do concurren to speed up.
Experiment1: In this experiment we test on replacing function
argsort
with Bubble Sort algorithm to Quick Sort algorithm. We show how it works better in mod_micro_nogtom.F90 only with nqx > 100. However, for current inputs with nqx = 5,7 or 10, there exists the same situation whereargsort
become much slower in Quick Sort for more complex computation. So functionargsort
can't be replaced by QuickSort. Finally, we still show this test and hope that QuickSort can be a choice for future version whenargsort
is reused and nqx expanded to larger scale.Experiment2: In this experiment we change the sequence of loop from k, i, j, n to n, k, i, j and use do concurrent properties to speed up. We switch n to the outside thus we decrease the time spent on if else statement. However it has no profit for the block along with do concurrent. We only replace it with do concurrent for better readability.
Experiment3: In this experiment, we use loop unrolling on the mysolve subroutine to speed up. We try to replace the nqx with its actual value, with input file ISC24.in nqx=5, this will help compiler to speed up. But it doesn't work well for small nqx scale.
To better contrast what we have done on the target file, you can choose mod_micro_nogtom.F90 and mod_micro_nogtom_old.F90 to scan for the same time