Multithreading test program

Scali2

2[H]4U
Joined
Apr 20, 2006
Messages
2,845
Since people asked, I'm posting the test application here, which I made for testing the speed of my multithreaded MarchingCubes algorithm.
(Don't bother asking for source, I'm not at liberty to release it. This test-program is based on so much more than just some simple testing code, and its sole purpose is to examine performance on various systems and try to avoid any bottlenecks).

I've made a few different versions, one of them using shared memory, which is a good way to test the speed of inter-core communication. As we found out on another forum, the HT-speed on Athlon64 affects this version... which was a bit of a surprise, since I've always heard people say that AMD uses a fast internal crossbar for intercore-communication, which would be BEHIND the HT-logic, so not dependent on it. But the test results look more like the crossbar *is* the HT-logic itself, and the communication between cores is done in a similar way as a system with two physical sockets/CPUs.
Would be interesting to see how a real dual-CPU Opteron system performs on this code.

The other version tries to run each thread as independently as possible. Which turned out to be considerably faster, even on a system with a shared cache.

Anyway, here's the first version, which is nice for measuring core-communication:
http://scali.eu.org/~bohemiq/Fire.rar
Fire.exe is the old single-threaded version
Fire-Multithread.exe does quite a bit of shared memory processing
Fire-Multithread2.exe is a first version that avoids the shared memory as much as possible.

And this is a later version, where I optimized the second version even more, and also put in a control to choose the number of threads. This one also contains a 64-bit version. I've not yet found anyone who could run it for me on a 64-bit AMD system.
http://scali.eu.org/~bohemiq/FireNew.rar
 
I get an error message whenever I try to run the multithreaded versions. Applcation failed to start because application configuration is incorrect. Reinstall the application may fix the problem. Single thread works fine.

WinXP SP2
Core 2 Duo E6600
 
So what exactly am I looking for in this test. My dual Core Opty runs it just fine It says I'm averaging ~250 with a min of 200 in your second test.. Though I'm only getting ~90% usage out of it.

Ive always understood that crossbar stuff to be A HT link between the two cores. I'm not the authority on this by any means that was just my take on it.

I'll also guinea pig my P4 with H/T if you want to see those results.
 
So what exactly am I looking for in this test. My dual Core Opty runs it just fine It says I'm averaging ~250 with a min of 200 in your second test.. Though I'm only getting ~90% usage out of it.

Not all code can run on both cores, only the actual volume algo can, not the surrounding code for animation/drawing. Which shound explain the 10% you're missing.
I'm currently only looking for results of multi-CPU systems, either Opteron or Xeon... and 64-bit results of Athlons/Opterons.
I have already gathered results for Core2, Pentium 4 HT, Pentium D and 32-bit Athlon and Athlon X2.

Ive always understood that crossbar stuff to be A HT link between the two cores. I'm not the authority on this by any means that was just my take on it.

Thing is that most review sites displayed schematics such as this one:
http://www.hothardware.com/viewarticle.aspx?page=2&articleid=767&cid=1

As you can see, they draw the crossbar between the cores and the actual HyperTransport interface. Because of this, it was common to believe that the cores themselves would talk directly via the crossbar, and the HT-link was only used for communication with the chipset and other CPUs in the system.
But if you lower the HT-link speed, you'll see that the multithreading performance drops aswell (at least in this application, probably also in others, but I don't know anyone who's tried).
A lot of people seemed to think the HT-speed doesn't matter, and lowering it makes for better overclocking. Even some review sites presented such results. But apparently it does affect multithreaded software.
 
I don't know why, but your rar files are not Linux friendly. What version of rar are you using to create these archives? Do you think you could provide the files in a free format that does not use proprietary encoding, such as a gzipped or bzipped tar archive?

Second, without source, it is very hard to draw any conclusions from the results at all. Perhaps you could describe your implementation in a little more detail? What method are you using for shared memory (mmap(), system IV, posix, threads with shared adress-space)? How is your memory allocation done? Single pool, or NUMA compatible? The more information you provide, the more meaning your benchmark will have.

Ive always understood that crossbar stuff to be A HT link between the two cores. I'm not the authority on this by any means that was just my take on it.
which was a bit of a surprise, since I've always heard people say that AMD uses a fast internal crossbar for intercore-communication, which would be BEHIND the HT-logic, so not dependent on it. But the test results look more like the crossbar *is* the HT-logic itself, and the communication between cores is done in a similar way as a system with two physical sockets/CPUs.
It should be a surprise, because the cores (on a single die) do not need to touch HT to transfer information between them. The crossbar is totally behind the HT link logic. Your code must be doing something other than or in addition to what you expect. I've done some of my own tests, all of which support the fact that the crossbar is located behing the HT logic. You are free to take a look at the full source at any time here: http://www.cs.utk.edu/~vose/linux/NUMA.tar.gz.
 
I don't know why, but your rar files are not Linux friendly. What version of rar are you using to create these archives? Do you think you could provide the files in a free format that does not use proprietary encoding, such as a gzipped or bzipped tar archive?

They're Windows executables (both x86 and x64), so don't bother with linux.

Second, without source, it is very hard to draw any conclusions from the results at all. Perhaps you could describe your implementation in a little more detail? What method are you using for shared memory (mmap(), system IV, posix, threads with shared adress-space)? How is your memory allocation done? Single pool, or NUMA compatible? The more information you provide, the more meaning your benchmark will have.

It's Windows, so all threads share the same address space (I use OpenMP to parallelize some for-loops).
There's no way to make it specifically NUMA-compatible since it is shared, hence both CPUs need to have read/write access. So there's no other way than to just allocate it on one CPU, and have the other CPU get extra latency for accessing...
The shared memory isn't that large though... it seems to fit pretty much entirely in L2-cache on CPUs with 1+mb. So what we are looking at is probably mostly inter-core communcation, not memory access.
That's why I'd like to compare a dual Opteron to a dualcore system, see what happens.
 
scali, what you're saying, are you implying that the cpu needs to go through the ht link to get to the memory?
 
scali, what you're saying, are you implying that the cpu needs to go through the ht link to get to the memory?

Well, I'm mostly talking about cache, as I said in the previous post... Pretty much everything fits in L2-cache... yet we see the performance drop when we reduce the HT-speed in the BIOS settings.
So that would mean that it has to go through the HT-link to synchronize the caches.
Which probably also means that memory accesses also have to use the HT-link, even if only to keep the caches synchronized.
 
For kicks I'm trying to run this on my Dual Pentium 3 workstation (dual P3 800E's, GF4 ti4200) but none of the multithreaded tests will run.

The single-threaded ones run fine at about 35fps average, but all of the multithreaded versions give me the incredibly non-useful error message "Fire*.exe has generated errors and will be closed by windows. You will need to restart the program."

this is on win 2k sp4.


EDIT: oh okay, I noticed that Fire32-x87.exe can do any number of threads and that one runs fine. Oddly enough, 2 threads are actually slightly slower than 1 thread (not by much...just like 1-2fps). I guess I'm seeing the giant memory bandwidth bottleneck that pc133 ram proposes to a dual-cpu system. Do Fire32.exe and the original Fire-Multithread*.exe use SSE2 or SSE3 instructions? Maybe that's why it fails on the P3.
 
Yes, most of them use SSE2... that makes it fair to compare the 32-bit and 64-bit versions (you can't use x87 in x64 mode).
I guess the memory could be a problem. On a modern system most stuff fits in cache... but on a P3 the cache is probably too small, so it will rely on memory more... and that memory is very slow aswell...
Else it could be the videocard. This test requires a card with decent hardware T&L, else the CPU has to do that aswell, and you're no longer measuring just the algo itself.
On a modern system a GeForce 6600 or Radeon X600 or so should be fast enough to not affect the results. For a P3 you can use a slower card, but I suppose you'd still need something like a GeForce4 to make sure the CPU is the bottleneck, not the GPU.
 
yeah this system has a 128mb geforce 4 ti4200 in it, so the video card should be alright. I guess 256k of L2 cache just doesn't cut it :p
 
Yea, hard to say if the videocard is a problem.
I just remembered though... In the Fire32-x87 there should be a feature that when you press the 'm'-key, the CPU-algo stops, and only the same image is rotated and redrawn.
See what framerate you get when you do that.
If it is much higher, then the GPU is probably not a bottleneck. If the framerate barely goes up, then it's the GPU holding it back.
For example, on my system I get about 430 fps with the CPU enabled, and around 1300 fps when I disable it. So in my case the GPU is way faster.

Another thing... Athlon64-users reported that the x87-version runs faster for them than the SSE2-version. On Core2 and Pentium the SSE2-version is the faster one.
Makes me wonder how the 64-bit version runs on Athlons, since there's no way to create an x87-version there...
 
Around 280 FPS with Opteron 165 @ 2.79 ghz (FSB at 310 with HT Link multiplier at 3x) with the Fire-Multithread2.exe from Fire.rar. Running Fire32.exe I get around 385 FPS and Fire32-x87.exe I get about 390 FPS.
 
Alright here we go: Fire32-x87 on the Dual P3 800E's

1 thread: CPU on 38 fps
CPU off 346 fps

2 threads: CPU on 40 fps
CPU off 341 fps

After some additional testing it does seem that 2 threads is faster than 1 but it's by a negligable amount. Video card is definitely not the bottleneck.

Now I realize that these results are probably not useful to you at all, but I thought maybe you'd at least be curious to see how the program fared on an old dual-socket system.

EDIT: also on a side note, with 2 threads running I was seeing cpu usage for Fire32-x87.exe in the 70-75% range. For 1 thread it was a nice solid 47-49%. Looks like there definitely something limited the multithreaded case as the second cpu is far from being fully loaded.
 
Well, it will be interesting to see how other multi-CPU systems compare...
I would expect that a 2-Xeon system performs about the same as a comparable dualcore.
With an Opteron system, there may be some differences when NUMA comes into play... but I suspect it will have little effect.
 
I just ran it on my C2D @ 3.4GHz on X64 using the default GPU settings with my lowly 7600GT and got the following

Min: 359
Max: 518
Average: 430 (but it is very hard to tell because it is bouncing all over the place)

I wish it had an FPS logging feature. Also it is only using about 75% of both cores.
 
Ah, I finally got results from an AMD processor in 64-bit:
DC opteron 165 @ 2,250GHz @ 250FSB (1000HT) / GF 7900GTX in 64bit Windows XP (with SP2 and latest hotfixes)

Fire.exe: ~170fps
Fire-Multithread.exe: ~180fps
Fire-Multithread2.exe: ~240fps
Fire32.exe: min: ~239fps avg: ~315fps max: ~322fps
Fire32-x87.exe: min: ~242fps avg: ~310fps max: ~349fps
Fire64.exe: min: ~288fps avg: ~300fps max: ~343fps

Apparently the minimum framerate is considerably better, but it can't reach higher absolute framerates. This makes me suspect that although in 64-bit the CPU does execute the code more quickly, the cache and possibly the chipset put a hard limit on the maximum speed, which is why it cannot get better results than 32-bit.
 
Back
Top