Martin Paech
06-14-2014, 06:48 AM
Dear all,
This is my first post in this forum, so please be patient with a "greenhorn" (and my English).
We are currently analyzing the performance of an application that uses SVD::decompose() heavily. On basis of profiler runs, we noticed that the matrix access isn't cache friendly---it goes column by column in the outer loops and rows in the inner loops:
[174] for (i=NRMIN(m,n)-1;i>=0;i--) {
[175] l=i+1;
[176] g=w[i];
[177] for (j=l;j<n;j++) u[i][j]=0.0;
[178] if (g != 0.0) {
[179] g=1.0/g;
[180] for (j=l;j<n;j++) {
[181] * for (s=0.0,k=l;k<m;k++) s += u[k][i]*u[k][j];
[182] f=(s/u[i][i])*g;
[183] * for (k=i;k<m;k++) u[k][j] += f*u[k][i];
[184] }
[185] for (j=i;j<m;j++) u[j][i] *= g;
[186] } else for (j=i;j<m;j++) u[j][i]=0.0;
[187] ++u[i][i];
[188] }
This means that we don't utilize the locality of elements from adjacent columns from the same row, resulting in e.g. approx. 34% L2 data cache misses on an Intel Itanium-9560 CPU. So, if we could possibly change the access pattern to be columns in the inner loop, the cache access would improve. However, this change may not be trivial, and we unfortunately don't have the ability to do this.
We were trying to change the loop orders, and what we did is just a quick and even incorrect change to some "hot parts" of the SVD::decompose(), lines 181 and 183 in svd.h, see above. Now the dcache data shows good improvement in L2 cache miss---only around 9.7% remain. Without saying, we can't use this change since this is incorrect, but we believe it should be possible to transpose the u[][] matrix, and then write the decompose() to change the row and column order and do the calculations correctly.
Do you think it would be worth trying this experiment to improve the performance? Beside the impact on the cache hits this might favor the auto-vectorization capabilities of compilers for AVX an similar CPU instruction-set extensions.
For any thoughts, comments, and suggestions we would be very grateful.
Martin
This is my first post in this forum, so please be patient with a "greenhorn" (and my English).
We are currently analyzing the performance of an application that uses SVD::decompose() heavily. On basis of profiler runs, we noticed that the matrix access isn't cache friendly---it goes column by column in the outer loops and rows in the inner loops:
[174] for (i=NRMIN(m,n)-1;i>=0;i--) {
[175] l=i+1;
[176] g=w[i];
[177] for (j=l;j<n;j++) u[i][j]=0.0;
[178] if (g != 0.0) {
[179] g=1.0/g;
[180] for (j=l;j<n;j++) {
[181] * for (s=0.0,k=l;k<m;k++) s += u[k][i]*u[k][j];
[182] f=(s/u[i][i])*g;
[183] * for (k=i;k<m;k++) u[k][j] += f*u[k][i];
[184] }
[185] for (j=i;j<m;j++) u[j][i] *= g;
[186] } else for (j=i;j<m;j++) u[j][i]=0.0;
[187] ++u[i][i];
[188] }
This means that we don't utilize the locality of elements from adjacent columns from the same row, resulting in e.g. approx. 34% L2 data cache misses on an Intel Itanium-9560 CPU. So, if we could possibly change the access pattern to be columns in the inner loop, the cache access would improve. However, this change may not be trivial, and we unfortunately don't have the ability to do this.
We were trying to change the loop orders, and what we did is just a quick and even incorrect change to some "hot parts" of the SVD::decompose(), lines 181 and 183 in svd.h, see above. Now the dcache data shows good improvement in L2 cache miss---only around 9.7% remain. Without saying, we can't use this change since this is incorrect, but we believe it should be possible to transpose the u[][] matrix, and then write the decompose() to change the row and column order and do the calculations correctly.
Do you think it would be worth trying this experiment to improve the performance? Beside the impact on the cache hits this might favor the auto-vectorization capabilities of compilers for AVX an similar CPU instruction-set extensions.
For any thoughts, comments, and suggestions we would be very grateful.
Martin