Can Anyone Explain This Bizzare Behavior?

Negative_3

n00b
Joined
Apr 14, 2008
Messages
45
Here is a little C program that I am compiling with Visual Studio 2005:

SourceCode3.jpg


This program just tells you how long it takes to run some random code. Notice that there are two things commented out (the green text). One would expect that uncommenting these things would slow the program down slightly, as it would have more to do.

First run the program and see how long it takes. Then uncomment the "+ 3.1" and run it again. It was expected to be slightly slower, but instead it is MANY TIMES FASTER! Can anyone explain this?

Next, uncomment the inline ASM block and run the program again. It was expected to be slightly slower, but instead it is SLIGHTLY FASTER! A random ASM statement that does NOTHING actually speeds the program up? Can anyone explain this?

Here is the code in non picture format:

Code:
#include <stdio.h>
#include <windows.h>

void main(void)
{
	int i,j;
	unsigned int startTime;
	float a,b,c,x,y,z,vx=1,ux=-1,wx=-3,vy=3,uy=4,wy=5,vz=6,uz=7,wz=8,denominator;
	__declspec(align(16))float f1[4]={0,0,0,0};

	printf("Starting...\n");

	startTime=GetTickCount();
	for(j=0;j<10000;j++)
	{
		for(i=0;i<10000;i++)
		{
			x=(float) i;
			y=x+1;
			z=y+1;
			
			//__asm
			//{
			//	movaps xmm0,f1
			//}

			denominator= -uz*vy*wx + uy*vz*wx + uz*vx*wy - ux*vz*wy - uy*vx*wz + ux*vy*wz /*+ 3.1*/;
			a= (uz*wy*x - uy*wz*x - uz*wx*y + ux*wz*y + uy*wx*z - ux*wy*z)/denominator;
			b= (vz*wy*x - vy*wz*x - vz*wx*y + vx*wz*y + vy*wx*z - vx*wy*z)/denominator;
			c= (uz*vy*x - uy*vz*x - uz*vx*y + ux*vz*y + uy*vx*z - ux*vy*z)/denominator;
		}
	}
	printf("Number Of Milliseconds: %u\n\n",GetTickCount()-startTime,a,b,c);
}
 
BTW - That last printf() statement has three unused items. I had to do something with a,b, and c or else the compiler would never even bother to compute them. I appreciate the time MS employees spent to weed out certain things to make my software faster, but sometimes I want the program to do EXACTLY what I tell it!! Especially for benchmarking!!!
 
BTW - That last printf() statement has three unused items. I had to do something with a,b, and c or else the compiler would never even bother to compute them. I appreciate the time MS employees spent to weed out certain things to make my software faster, but sometimes I want the program to do EXACTLY what I tell it!! Especially for benchmarking!!!
Not sure if feeding printf() garbage was the right approach. :p
 
If you don't want the variable to exist only in registers, declare it "volatile". Remove the printf garbage.
 
BTW - That last printf() statement has three unused items. I had to do something with a,b, and c or else the compiler would never even bother to compute them. I appreciate the time MS employees spent to weed out certain things to make my software faster, but sometimes I want the program to do EXACTLY what I tell it!! Especially for benchmarking!!!
Then turn off compiler optimizations.
 
So, compile both (change a b c to volatile float, remove from printf), look at disassembly and compare...

First things first, you'll notice right away that the calculation for a b and c are almost the same, and can be ignored for performance comparison:
Code:
			a= (uz*wy*x - uy*wz*x - uz*wx*y + ux*wz*y + uy*wx*z - ux*wy*z)/denominator;
00C31068  fld         qword ptr [__real@4041800000000000 (0C321A0h)]  
00C3106E  fmul        st,st(1)  
00C31070  fld         st(1)  
00C31072  fmul        qword ptr [__real@4040000000000000 (0C32198h)]  
00C31078  fsubp       st(1),st  
00C3107A  fld         st(3)  
00C3107C  fmul        qword ptr [__real@c035000000000000 (0C32190h)]  
00C31082  fsubp       st(1),st  
00C31084  fld         st(3)  
00C31086  fmul        qword ptr [__real@4020000000000000 (0C32188h)]  
00C3108C  fsubp       st(1),st  
00C3108E  fld         dword ptr [esp+0Ch]  
00C31092  fld         qword ptr [__real@4028000000000000 (0C32180h)]  
00C31098  fmul        st,st(1)  
00C3109A  fsubp       st(2),st  
00C3109C  fld         qword ptr [__real@c014000000000000 (0C32178h)]  
00C310A2  fmul        st,st(1)  
00C310A4  fsubp       st(2),st  
00C310A6  fxch        st(1)  
00C310A8  fdiv        st,st(3)  
00C310AA  fstp        dword ptr [esp+0Ch]

Also, the looping part of the i and j loops is the same:
Code:
00C31128  mov         dword ptr [esp+0Ch],eax  
00C3112C  cmp         eax,2710h  
00C31131  jl          main+41h (0C31041h)  
	for(j=0;j<10000;j++)
00C31137  dec         ecx  
00C31138  jne         main+35h (0C31035h)  
00C3113E  fstp        st(0)  
		}

Also, the setting of y and z is the same:
Code:
			y=x+1;
000A104E  fld         dword ptr [esp+0Ch]  
000A1052  fld         st(0)  
000A1054  fadd        st,st(2)  
000A1056  fstp        dword ptr [esp+0Ch]  
			z=y+1;
000A105A  fld         dword ptr [esp+0Ch]  
000A105E  fld         st(0)  
000A1060  faddp       st(3),st  
000A1062  fxch        st(2)  
000A1064  fstp        dword ptr [esp+0Ch]

Simplifying it down, you have the following two blocks of code.
Fast:
Code:
		for(i=0;i<10000;i++)
000A103D  xor         eax,eax  
000A103F  mov         dword ptr [esp+0Ch],eax  
000A1043  fld1  
000A1045  inc         eax  
		{
			x=(float) i;
000A1046  fild        dword ptr [esp+0Ch]  
000A104A  fstp        dword ptr [esp+0Ch]  

*****Y AND Z SET HERE*****

*****COMPUTATION OF A B C WAS HERE*****

*****LOOPING ON I WAS HERE*****

*****LOOPING ON J WAS HERE*****

Slow
Code:
		for(i=0;i<10000;i++)
00C31037  xor         eax,eax  
00C31039  fxch        st(1)  
00C3103B  mov         dword ptr [esp+0Ch],eax  
00C3103F  jmp         main+45h (0C31045h)  
00C31041  fld1  
00C31043  fxch        st(1)  
		{
			x=(float) i;
00C31045  fild        dword ptr [esp+0Ch]  
00C31049  inc         eax  
00C3104A  fstp        dword ptr [esp+0Ch]  

*****Y AND Z SET HERE*****
			
			//__asm
			//{
			//	movaps xmm0,f1
			//}

			denominator= -uz*vy*wx + uy*vz*wx + uz*vx*wy - ux*vz*wy - uy*vx*wz + ux*vy*wz /*+ 3.1*/;

*****COMPUTATION OF A B C WAS HERE*****

*****LOOPING ON I WAS HERE*****

*****LOOPING ON J WAS HERE*****

In the fast version, the denominator is pulled outside the loop as such:
Code:
			denominator= -uz*vy*wx + uy*vz*wx + uz*vx*wy - ux*vz*wy - uy*vx*wz + ux*vy*wz + 3.1;
000A1035  fstp        dword ptr [esp+0Ch]  
000A1039  fld         dword ptr [esp+0Ch]  
	{

In the slow version, the denominator is never set...

So, I then decide to check your math:
Code:
denominator = -uz*vy*wx + uy*vz*wx + uz*vx*wy - ux*vz*wy - uy*vx*wz + ux*vy*wz
            = -7 *3 *-3 + 4 *6 *-3 + 7 *1 *5  - -1*6 *5  - 4 *1 *8  + -1*3 *8
            = 63        + -72      + 35       - -30      - 32       + -24
            = 0

Next time, choose a denominator that doesn't equal 0, sparky ;)
 
I was just picking arbitrary numbers for vx, uy, wz, ect... I assumed the chances of them making the denominator 0 was 0. My bad.

I guess I still don't quite understand why uncommenting the useless asm block speeds things up though. I don't get that at all.

And suppose I use a multiplication such as x*y multiple times. If optimization is on, will the compiler just multiply it once and use that result every time until either x or y changes? Is it better to let the compiler do stuff like that or should I do it myself?

Also, a question about volatile variables: Aren't all changes to volatile variables updated in the system RAM, not just registers or CPU cache? So if two threads on two different CPUs are both reading and altering the same volatile variable they can see what each other are doing? And if the variable weren't volatile then they might be reading and altering two separate copies of the variable which reside on each CPU's cache?
 
I was just picking arbitrary numbers for vx, uy, wz, ect... I assumed the chances of them making the denominator 0 was 0. My bad.
Assumption can bite you. Error checking helps :D

I guess I still don't quite understand why uncommenting the useless asm block speeds things up though. I don't get that at all.
I didn't bother checking or verifying if it actually does. By how much? Is it repeatable? Is it always faster?

And suppose I use a multiplication such as x*y multiple times. If optimization is on, will the compiler just multiply it once and use that result every time until either x or y changes? Is it better to let the compiler do stuff like that or should I do it myself?
It depends. What is your app goal? Most of the time you can rely on the compiler to handle it, then profile the code for performance. If the performance is acceptable, don't bother. If, however, you have critical timing goals and require max performance, analysis of the code can reveal any places where a lot of execution time is being spent. You can then examine and possibly optimize that code.

Also, a question about volatile variables: Aren't all changes to volatile variables updated in the system RAM, not just registers or CPU cache? So if two threads on two different CPUs are both reading and altering the same volatile variable they can see what each other are doing? And if the variable weren't volatile then they might be reading and altering two separate copies of the variable which reside on each CPU's cache?

I suggested using volatile because you desired to view the output of a variable that was not being used. Here's a good primer on volatile: http://www.embedded.com/story/OEG20010615S0107
 
I'm still really curious about what you're trying to benchmark. I'm also wondering why you would use an inline assembly block and yet you don't know how an optimizer works?
 
If you don't want the variable to exist only in registers, declare it "volatile".
This isn't what the "volatile" keyword does. Not exactly, anyway.
 
Last edited:
This isn't what the "volatile" keyword does. Not exactly, anyway.

It's one of the things it does, and it's close enough to keep Negative_3 from losing the variable due to optimization ;).

What, do you want me to write a mini-essay on "volatile" to explain it? I mean, I respect you and all, but that does seem quite the snarky response for a topic that was already solved.

Just in case, though...

Negative_3, volatile is a keyword that is used to guarantee that any access to a particular memory reference always actually checks that reference for changes, as it is potentially subject to sudden and unpredictable changes that standard optimization would make faulty assumptions about. As a result, there are certain limitations enforced upon the optimizer based upon the keyword. One of the more important limitations is to prohibit reordering reads and writes to a volatile declared locations. This helps to ensure multithreaded code functions as intended after optimization.... blah blah blah I'm tired and going to bed now >.>
 
Back
Top