fastest way of implementing loops [Archive]

fastest way of implementing loops

michielm

03-04-2010, 05:00 AM

Hi all,
I'm programming a routine to solve a large linear system at every time step of my program, thus I'm looking for ways to keep my code as fast as possible.

My question is which of the following DO loops will result in the fastest program:

DO i=1,N
X(i)=3*Y(i)+4
Z(i)=7*Y(i)-7
END DO

or

DO i=1,N
X(i)=3*Y(i)+4
END DO

DO i=1,N
Z(i)=7*Y(i)-7
END DO

The first looks the cleanest, but I remember a teacher of mine once told me something about the way fortran makes its calculations in loops and that the second option should be faster, but I'm not exactly sure anymore. Does someone know how this works exactly and which one is faster?!

davekw7x

03-05-2010, 01:27 PM

...I remember a teacher of mine once told me something about the way fortran...

Well, somewhere around 1965, maybe there was a way that "FORTRAN does things." FORTRAN was totally "managed" by IBM and implementation details were fairly well known in the industry. Implementations by other vendors (XEROX, CDC, UNIVAC, etc.) did things pretty much the same way (as far as I know). Nowadays, there is no such a thing as a way that "Fortran does things." Optimizing compilers from different vendors may do things very differently from each other and very differently from the way that compilers did things in the middle of the 20th century. Target-specific optimizations may have very different "rules of thumb" for efficient programming. See Footnotes.

I mean, it is possible that a particular compiler on a particular machine might be faster with two one-statement loops than with a single two-statement loop. Whether I doubt that or not is not an issue here. I probably would not have stood up in class and challenged my instructor to a coding duel, but I am pretty sure that I would have really tested it on my own. Really.

Even if someone shows me a dump of the machine code from a program, I will note that, regardless of the "quality" of the sequences of assembly language instructions from a particular compiler, differences in instruction cache and data cache performance for different CPU revisions of modern target processors with the same basic instruction sets really makes me question the validity of any claims of optimization without actually testing.

However...
If you are concerned about timing of a particular program construct, you can do tests with your particular compiler.

More importantly, I have seen people agonize over, and spend lots of time fine-tuning, certain parts of a complicated program when a careful analysis (or use of a profiling program) shows that they optimized the wrong thing. I mean, it's usually a Good Thing to be aware of "efficient" coding practices, but I think that code clarity and maintainability are far more important than rules of thumb that clutter up the code with more-or-less ineffective attempts at out-optimizing a modern compiler. But that's just me; I'm funny that way.

Bottom line: Regardless of the results of simple tests of isolated program constructs, overall program performance depends on everything going on in the program (and in other processes that happen to be active in a modern operating system).

Regards,

Dave

Footnotes:

1. Here's the way some might have recommended in the "olden" days:

C USE A TEMPORARY VARIABLE TO KEEP FROM
C HAVING TO CALCULATE THE ARRAY INDEX
C MORE THAN ONCE IN THE LOOP
DO 10 I=1,N
YI=Y(I)
X(I)=3*YI+4
Z(I)=7*YI-7
10 CONTINUE

Simple tests with GNU gfortran on one of my Linux systems do show a very slight improvement over your first example, and, contrary to what you think your instructor told you, your first example is slightly better than your second. (No command-line optimization switches, and enough nested loops to give a couple of seconds cpu time---20,000 iterations of an outer loop with an inner loop consisting assignments to elements of two double precision arrays with 10,000 elements each.) I didn't spend much time writing the code or analyzing the test conditions, and I don't care if others get different results and come to Important Different Conclusions. Really. I don't care. I'm not a "wizard," and it really doesn't matter.

2. Code from the real "wizards" of 1965 (running FORTRAN II on an IBM 7090) might might look more like the following (except that, in my experience, the real "wizards" didn't always bother with comments that explained their methodology):

C USE A TEMPORARY VARIABLE TO KEEP FROM
C HAVING TO CALCULATE THE ARRAY INDEX
C MORE THAN ONCE IN THE LOOP.
C
C USE ADDITION RATHER THAN MULTIPLICATION
C WHERE POSSIBLE SINCE MULTIPLICATION TAKES
C ABOUT SEVEN TIMES AS LONG AS ADDITION.
C
C ALSO NOTE THAT ASSIGNMENT STATEMENTS WITH
C MULTIPLE OPERATIONS ON THE RIGHT HAND SIDE
C MAY BE LESS EFFICIENT THAN A SEQUENCE OF
C ASSIGNMENTS WITH SIMPLE EXPRESSIONS.
C
DO 10 I=1,N
YI=Y(I)
Y3=YI+YI
Y3=Y3+YI
Y7=Y3+Y3
Y7=Y7+YI
X(I)=Y3+4
Z(I)=Y7-7
10 CONTINUE

Note that the little "rules of thumb" might be different for expressions with integers than for expressions with reals.

Note also, that if the formulas changed to, say X(I)= 17*X(I)-16 and Z(I)=14*Z(I)-14, parts the program would have had to be rewritten and re-hand-optimized (and re-tested and re-debugged). In library subprograms like the excellent 1960's IBM Scientific Subroutine Package, the very real performance advantage of such little "tricks" was obviously worth it to the authors and users of the day.

3. I don't have access to any kind of parallel or other multi-processing setup, and have no experience with any such thing. If someone tells me that their experience in a parallelized program is that code like the two-loop example of your post results in less elapsed time than the single-loop example (or is in any other way more "efficient"), it would be enlightening, but I would have no way of testing it.

"For a successful technology, reality must take precedence over public relations,
for Nature cannot be fooled."
---Richard Feynman

"Don't take other people's word for it. Do your own research."
---davekw7x

michielm

03-08-2010, 04:55 AM

hi dave
Thanks for the extensive answer.
You're absolutely right that the best way to check something like this is just to test it, but because I don't want to test every single program structure I write, I was just wondering if anyone knew some 'rules of thumb' by which to improve my programs speed.

As you clearly point out, this type of optimization is probably useless with modern compilers, so I will take your advice and code my program in an organized and well-readable manner instead (without messing up efficiency all together of course).

davekw7x

03-08-2010, 03:42 PM

...extensive answer...Sometimes I get a little wordy.
...I was just wondering...I think it's a Good Thing to poke around on the web and ask around. Sometimes people can give straightforward answers that can save lots of time. But (and I hate to repeat myself): don't take anyone's word for it. If it is important, test it for yourself with conditions like those that your program will encounter.
...your advice...Well, it is not exactly advice that I offer; it is an opinion. See Footnote.

I would like to see other opinions (especially if people can back them up with specific examples).

Regards,

Dave

Footnote:

"The opinions expressed here are not mine, but those of the dang voices in my head."
---Anonymous

DavidB

03-26-2010, 12:02 PM

Fortran stores arrays by column, rather than row. Some programs try to take advantage of this feature by manipulating arrays in terms of columns rather than rows.

However, this feature is irrelevant to the code block you posted.

Just looking at your two versions, the first one should be faster because you have eliminated N loops and the associated checks. Some coders try to optimize their code by eliminating as many checks for (i== N) as possible. They do some loop unrolling. Basically, every time the program goes through the loop, it has to check to see if the limit has been reached: does (i== N)? In which case, it knows it can exit the loop. This check takes time, so if it can be reduced, it should speed up program execution.

A Google search for "loop unrolling" or "loop unwinding" should give many results.

Here is some info on Wikipedia:

Loop Unwinding Page on Wikipedia (http://en.wikipedia.org/wiki/Loop_unwinding)