NF200 "True" 3-Way SLI Preliminary Results @ [H]

DragonNOA1 · Jan 6, 2009

Right after you started posting videos I started seeing other websites post video reviews, etc about computer hardware. You are ahead of the curve Kyle; keep those videos coming. Great job.

Hypernova · Jan 6, 2009

ND40oz said:
Then you get into semantics, just like we had with "true" quad core cpus. Either way, with this solution, every card has 16 PCIe 2.0 lanes for Tri-SLI.

It's of a case of "which way". IMO proper counting should be CPU->GPU not GPU<->GPU. Otherwise what's going to stop companies from stacking NF200's and get "true" 16way x16 lanes off only a single x16 to the CPU? IMO we should count from the bottleneck. Granted B/W isn't an issue right now.

trudude · Jan 6, 2009

All I can say is that I am not surprised with the results. I really hope that Intel releases X58's with Hydra sometime this year. We REALLY need near 100% scaling moving forward.

Mega666 · Jan 6, 2009

Hypernova said:
It's of a case of "which way". IMO proper counting should be CPU->GPU not GPU<->GPU. Otherwise what's going to stop companies from stacking NF200's and get "true" 16way x16 lanes off only a single x16 to the CPU? IMO we should count from the bottleneck. Granted B/W isn't an issue right now.

We won't know how "true" Nv's 16x lanes are until the 3 x (16x 2.0) pcie gets close to saturation. It does seem to stay even to 8x(hardware nf200 vs software), for whatever that's worth. So far nothing it would seem.

ND40oz · Jan 6, 2009

Hypernova said:
It's of a case of "which way". IMO proper counting should be CPU->GPU not GPU<->GPU. Otherwise what's going to stop companies from stacking NF200's and get "true" 16way x16 lanes off only a single x16 to the CPU? IMO we should count from the bottleneck. Granted B/W isn't an issue right now.

If you're going to worry about hanging an NF200 chip off the northbridge, then you'll probably want to worry about hanging a southbridge off the northbridge too. Take a look at Greencreek (5000x) and Seaburg (5400) and all the devices the ESB2 southbridge can have hanging off of it with only 2GB/s (PCIe 2.0 x4) or 4GB/s (PCIe 2.0 x8) bandwidth depending on how it's configured.

wolf2009 · Jan 6, 2009

So that's Kyle ......

bassman · Jan 6, 2009

LibertySyclone said:
This doesnt suprise me at all.

Look at the Tesla line of compute GPU's specifically the S1070, they are putting 2 gpus onto a x8 pci-e bus and not seeing much in the way of slowing down computing.

First, the S1070 puts 2 Teslas on one x16 Gen 2 slot, not on one x8.

Second, your statement is application-dependent. Which CUDA application are you using as an example? I saw a slowdown relative to perfect scaling in my work with CUDA when going from 2x8800GTX (both at x16 Gen 1) to a single D870 (two cards sharing one x16 Gen 1). Adding timers to my code showed a definite increase in the amount of time required to transfer data to the cards. My app requires a bunch of transfers between CPU and GPU, so I was not surprised.

My point is you can't use one example from CUDA and apply it to graphics performance scaling. Hell, you can't even take one CUDA app and use it to estimate scaling for another CUDA app - there's just too much tied up in the specific way the app is coded.

psyshack · Jan 6, 2009

Thanks for the work Kyle.

jhtevans · Jan 6, 2009

The stability of the board is what makes it valuable even in the face of just barely losing this benchmark, at least in my mind. It'll be interesting to see if newer drivers will make a difference here - GPU or perhaps mobo driver updates as well as patches to games could start taking advantage of the x16X3 now that it's here.

jebo_4jc · Jan 6, 2009

thanks for the info.
I was definitely curious about this

codegrinder · Jan 7, 2009

I want a 30" now.... and 3 of those cards

Elledan · Jan 7, 2009

Mega666 said:
There are two advantages of the nf200 promoted by Nv. One is peer to peer writes which is the "Broadcast" function, the other is "PW short". The pcie bus has now implemented native peer-to-peer writes (as of pcie v2.0) which eliminates the need for the"Broadcast" functionality of the nf200.

This only leaves the the "PW short" function. This function reduces the communications over the FSB (QPI in x58) while reducing latencies for the data sent between the cpu(s) and multiple gpu(s). From the data we have seen so far, (Kyle's comparison and the excellent sli scaling seen with the x58 in other reviews) that would mean there is no bottleneck in the cpu(s) <-> gpu(s) communications. This is due to Neleham and the x58's qpi interface with a bandwidth of 25.6 Gb/s. Until video cards arrive that push the x16 pcie 2.0 bus limits, we will probably never see any advantages of a "PW Short" function.

Thanks for the information

It seems that the only benefit NF200 can offer at this point is increased bandwidth between the devices hooked up to it (two GPUs in this case). The benchmark seems to show a significant increase in latency (not talking about FPS here, but whether an effect is noticeable and repeatable), which would basically indicate that at this point NF200 is pretty much a useless and one hell of a hot chip.

I'm guessing that by the time x16, or even x8 Gen 2 BW is getting saturated by devices, there'll be a Gen 3 version of PCIe. And chipsets may start adding more PCIe lanes.

As for the effect bassman saw with CUDA, it could be due to parallelism (two GPUs sharing the same lanes instead of having a dedicated link), which would introduce latency. My knowledge of the PCIe spec is too limited to say something definite on this, though. NVidia seems to think that the S1070 with its four GPUs will do fine with an x8 link or better, so bandwidth can't be the issue.

Ah well...

Sojuuk · Jan 7, 2009

keep in mind guys that system costs $2900 without cases or cdroms or any decent sized hard drives.
did a quick parts tally on newegg..... anyways I would buy a car long before I bought all that.

INFINITE · Jan 7, 2009

Sorry I should have made myself more clear. 3 way (16x 16x 16x) sounded like a gimmick to me. Not the base fact of using 3 cards.

Also thanks Kyle for the speedy review I too was curious as to what the results would be.

BloodyIron · Jan 7, 2009

cyclone3d said:
So they cludged together a platform that doesn't even have true 16/16/16 pci-e... nice one!

What is the point then?

What about running SLI on slots 2/3 verses slots 1/2?

And.. no PCI slots.... bleh.. stuck with onboard audio.

This doesn't appear to be an asus exclusive thing, this appears to be an Core i7 general architecture type.

ND40oz · Jan 7, 2009

cyclone3d said:
So they cludged together a platform that doesn't even have true 16/16/16 pci-e... nice one!

What is the point then?

Have you paid attention to any nVidia based boards in the past two years? Skulltrail? 7950GX2, 9800GX2 or GTX295? How do you think they're getting the extra lanes there?

BloodyIron · Jan 7, 2009

I don't see how you conclude that the nf200 is connecting Cards 1 and 3. The illustration you provided (which I've seen from tomshardware) are merely example setups. Manufacturers could use any sort of configuration, as for which actual PCIe ports go to the nf200 and which go to the x58 directly.

My original diagram was merely an example of how the nf200 bottlenecks a TRI-SLI setup on the PCIe bus. As to whether it actually bottlenecks it is uncertain.

Even still, if you have three expansion cards that uses the kind of bandwidth you would come to expect from 3x PCIe x16 slots, you would see drastic under performance from two of them, maybe even less than half due to latencies. I'm uncertain what devices could put that kind of bandwidth across the bus, but perhaps large-scale RAID controllers, or large-scale fibre controllers could.

ND40oz said:
Then you get into semantics, just like we had with "true" quad core cpus. Either way, with this solution, every card has 16 PCIe 2.0 lanes for Tri-SLI.

Your diagram is a little off, NF200 goes to slots 1 and 3 (cards 1 and 2) and X58 goes to slot 5 (card 3).

nVidia has also stated that you can use 2 NF200 chips to provide 4 full bandwidth lanes, although I haven't seen a board configured like this, maybe EVGA will surprise us with the FTW version.

ND40oz · Jan 7, 2009

BloodyIron said:
I don't see how you conclude that the nf200 is connecting Cards 1 and 3. The illustration you provided (which I've seen from tomshardware) are merely example setups. Manufacturers could use any sort of configuration, as for which actual PCIe ports go to the nf200 and which go to the x58 directly.

My original diagram was merely an example of how the nf200 bottlenecks a TRI-SLI setup on the PCIe bus. As to whether it actually bottlenecks it is uncertain.

Even still, if you have three expansion cards that uses the kind of bandwidth you would come to expect from 3x PCIe x16 slots, you would see drastic under performance from two of them, maybe even less than half due to latencies. I'm uncertain what devices could put that kind of bandwidth across the bus, but perhaps large-scale RAID controllers, or large-scale fibre controllers could.

With this board, it appears the nForce 200 chip is connected to slots 1 - 4. With slots 1 and 3 populated, you have x16 to both slots. With slots 2 and 4 also populated those 4 slots get 8 lanes each. Which is the 32 lanes that NF200 provides. Slot 5 is always at 16 lanes, and slot 6 is always at 4 lanes, these are the remaining 20 lanes from X58.

If NF200 does bottleneck a TRI-SLI setup, then it would also bottleneck it on any 780i board, since it's setup the same way.

LibertySyclone · Jan 7, 2009

bassman said:
First, the S1070 puts 2 Teslas on one x16 Gen 2 slot, not on one x8.

Second, your statement is application-dependent. Which CUDA application are you using as an example? I saw a slowdown relative to perfect scaling in my work with CUDA when going from 2x8800GTX (both at x16 Gen 1) to a single D870 (two cards sharing one x16 Gen 1). Adding timers to my code showed a definite increase in the amount of time required to transfer data to the cards. My app requires a bunch of transfers between CPU and GPU, so I was not surprised.

My point is you can't use one example from CUDA and apply it to graphics performance scaling. Hell, you can't even take one CUDA app and use it to estimate scaling for another CUDA app - there's just too much tied up in the specific way the app is coded.

I do understand that my statement was VERY vague, My point was that really there isnt "that much" of performance hit. With "that much" obviously being subjective to the task at hand.

The underlying point is that there is a lot of bandwidth available and in non gaming situations like the S1070 companies like NVIDIA arent completely worried about over saturating the bus. I believe that they were even planning on bringing out a S chassis with all four cards over a x16 card.

Madman0 · Jan 7, 2009

DragonNOA1 said:
Right after you started posting videos I started seeing other websites post video reviews, etc about computer hardware. You are ahead of the curve Kyle; keep those videos coming. Great job.

No offense to Kyle but do you actually think he was the first to do video reviews? Certainly not the first of any type, I haven't seen many videos that are mini-articles of original testing though. Personally I prefer text articles, it seems like it's a lot less work for Kyle, use less bandwidth, and can contain more information that is easier to absorb. I don't need to hear 'by16 by16 by16' a bunch of times

It's the same as other video reviews: just give me the info so I can scan for what I want to know rather than giving me 5 minutes of unindexed info.

I'd prefer to see text along with video but it's not so bad for these 'mini articles' and video does suit some things like visual hardware previews.

BloodyIron · Jan 7, 2009

ND40oz said:
With this board, it appears the nForce 200 chip is connected to slots 1 - 4. With slots 1 and 3 populated, you have x16 to both slots. With slots 2 and 4 also populated those 4 slots get 8 lanes each. Which is the 32 lanes that NF200 provides. Slot 5 is always at 16 lanes, and slot 6 is always at 4 lanes, these are the remaining 20 lanes from X58.

If NF200 does bottleneck a TRI-SLI setup, then it would also bottleneck it on any 780i board, since it's setup the same way.

What's your source on this x58 lanes config? Maybe I missed it somewhere, but I dont recall it being that specific...

On a related note, I've seen quite a few stability issues with the NF7 series chipsets

ND40oz · Jan 7, 2009

BloodyIron said:
What's your source on this x58 lanes config? Maybe I missed it somewhere, but I dont recall it being that specific...

On a related note, I've seen quite a few stability issues with the NF7 series chipsets

Check the other 2 threads about this board.

Also, Skulltrail and some Precision workstations use nForce 100 chips to multiplex lanes for SLI too, they're both pretty stable 2P solutions.

FrgMstr · Jan 7, 2009

Madman0 said:
No offense to Kyle but do you actually think he was the first to do video reviews? Certainly not the first of any type, I haven't seen many videos that are mini-articles of original testing though. Personally I prefer text articles, it seems like it's a lot less work for Kyle, use less bandwidth, and can contain more information that is easier to absorb. I don't need to hear 'by16 by16 by16' a bunch of times It's the same as other video reviews: just give me the info so I can scan for what I want to know rather than giving me 5 minutes of unindexed info.

I'd prefer to see text along with video but it's not so bad for these 'mini articles' and video does suit some things like visual hardware previews.

That is why we included the picture. That is all folks like you needed or wanted.

Litfod · Jan 7, 2009

Have you tried the same apples-to-apples comparison for dual SLI (or single card for that matter) at all? I'm curious if you get the same 1-2% difference across all configurations.

Love these videos though. The no-nonsense, "here's exactly what we did, here's what we saw" style is spot-on and refreshingly gimmick-free.

bassman · Jan 7, 2009

LibertySyclone said:
I do understand that my statement was VERY vague, My point was that really there isnt "that much" of performance hit. With "that much" obviously being subjective to the task at hand.

My point is your broad statement about multi-gpu Tesla doesn't apply to multi-gpu graphics. "That much" was 15% in my case. Coincidentally, there's a thread in the CUDA forums right now about an NV employee testing an app on a 4-GPU setup where "doubling the PCIe bandwidth gave him a 30% performance increase."

http://forums.nvidia.com/index.php?s=&showtopic=85787&view=findpost&p=487250

So, many apps DO see significant gains with more PCIe bandwidth. It just appears that regular, old graphics rendering doesn't.

LibertySyclone said:
The underlying point is that there is a lot of bandwidth available and in non gaming situations like the S1070 companies like NVIDIA arent completely worried about over saturating the bus. I believe that they were even planning on bringing out a S chassis with all four cards over a x16 card.

NV knows there are apps that require a ton of PCIe bandwidth, there are apps that are bound by on-GPU memory bandwidth, and there are apps that are bound by raw multiprocessor speed. For people who needs lots of PCIe bandwidth (like me), you buy individual C1060 cards and a motherboard with lots of PCIe lanes. For an app that is compute-bound, you can save money by getting a motherboard with one x16 connector and the S1075 you referred to.

Just because they're planning a product to address one of those groups doesn't mean they aren't worried about the others. In fact in the post linked above, Tim Murray pointed out they are already very close to saturating it now. I can get 5+ GB/s (out of the theoretical 8GB/s max) to each card in my setup, and I need it.

Toytown · Jan 7, 2009

Anybody getting a problem viewing the video, the last few times ive viewed the site, its always came up with "Video Not Available"

Mr. Bluntman · Jan 7, 2009

Toytown said:
Anybody getting a problem viewing the video, the last few times ive viewed the site, its always came up with "Video Not Available"

Yeah, ever since I first noticed the article was up I haven't been able to view it. Do you guys pull the vid or something?

FrgMstr · Jan 7, 2009

Litfod said:
Have you tried the same apples-to-apples comparison for dual SLI (or single card for that matter) at all? I'm curious if you get the same 1-2% difference across all configurations.

Love these videos though. The no-nonsense, "here's exactly what we did, here's what we saw" style is spot-on and refreshingly gimmick-free.

No, we have not done that. Thanks for the kind words.

Toytown said:
Anybody getting a problem viewing the video, the last few times ive viewed the site, its always came up with "Video Not Available"

Will not work for me in Chrome or FF, works in IE every time. Dunno what is up at Viddler. The companies uptime is sketchy, and is the reason we have not contracted with it yet.

zombiebeat · Jan 7, 2009

Sojuuk said:
what kind of powersupply is needed to run all that O_O

got a small geothermal plant nearby?

WTFV!

Was a Thermaltake ToughPower 1200W.

Kyle, really enjoying these videos, they bring something different to the table and it's nice to get a real good idea of what the hardware is like.

I had a feeling something big was afoot, reading between the lines on the site/forum, with the NF200 and "true" 3-way SLi these past few days.

So it's essentially worthless, bit disappointing but hey-ho.

Elledan · Jan 7, 2009

bassman said:
My point is your broad statement about multi-gpu Tesla doesn't apply to multi-gpu graphics. "That much" was 15% in my case. Coincidentally, there's a thread in the CUDA forums right now about an NV employee testing an app on a 4-GPU setup where "doubling the PCIe bandwidth gave him a 30% performance increase."

http://forums.nvidia.com/index.php?s=&showtopic=85787&view=findpost&p=487250

So, many apps DO see significant gains with more PCIe bandwidth. It just appears that regular, old graphics rendering doesn't.

Sure, if you need to cram full the 4 GB VRAM of a Tesla card with data, then yes, more bandwidth will decrease the total time spent on the process, but if this is your primary concern, and actually significantly impacts your application, you should look at options like streaming in the data or using chunks. If games were to pre-load every single texture and data they might possibly use in the entire game, then load times would be insane too. Fortunately, games don't do this.

CharonPDX · Jan 7, 2009

Alright, to make it simple:

Yes, An NF200 chip allows two of the three cards to talk to each other at full speed. It severely cripples the bandwidth between those two cards together to the chipset, though.

X58 native: One x16 2.0 connection plus two x8 2.0 connections. 8 GB/s to the x16 connection, 4 GB/s each to the x8 connections. Direct bandwidth between cards and chipset.

X58+NF200: One x16 2.0 connection, plus two bridged x16 connections. The single x16 2.0 connection has 8 GB/s direct to the chipset. However, the NF200 chip is only a PCI Express 1.0 device. This means that its connection to the chipset is x10 1.0, or only 4 GB/s. Its connection to the two video cards is the full x16 2.0, though. So you've got 8 GB/s between the two cards; but then only 4 GB/s total for both cards to the chipset.

I don't know about you, but I'd rather have two cards at 4 GB/s each, than a pair of cards at 4 GB/s total; even if those two can talk to each other at 8 GB/s.

This also brings up: Does the NF200 chip *ALWAYS* split the one x16 2.0? i.e. If I only plug in two cards, will one connect at x16 2.0, and the other to the NF200 at x16 2.0; but then the NF200 will chop that into x16 1.0? A 'base' X58 with autoswitch to only drop to x8 when a third card is inserted would be better here, because then you'd get your full 8 GB/s for two cards.

Atech · Jan 7, 2009

Thanks for the info, keep the videos comming always good to see hardware in "motion"

ND40oz · Jan 7, 2009

CharonPDX said:
Alright, to make it simple:

Yes, An NF200 chip allows two of the three cards to talk to each other at full speed. It severely cripples the bandwidth between those two cards together to the chipset, though.

X58 native: One x16 2.0 connection plus two x8 2.0 connections. 8 GB/s to the x16 connection, 4 GB/s each to the x8 connections. Direct bandwidth between cards and chipset.

X58+NF200: One x16 2.0 connection, plus two bridged x16 connections. The single x16 2.0 connection has 8 GB/s direct to the chipset. However, the NF200 chip is only a PCI Express 1.0 device. This means that its connection to the chipset is x10 1.0, or only 4 GB/s. Its connection to the two video cards is the full x16 2.0, though. So you've got 8 GB/s between the two cards; but then only 4 GB/s total for both cards to the chipset.

I don't know about you, but I'd rather have two cards at 4 GB/s each, than a pair of cards at 4 GB/s total; even if those two can talk to each other at 8 GB/s.

This also brings up: Does the NF200 chip *ALWAYS* split the one x16 2.0? i.e. If I only plug in two cards, will one connect at x16 2.0, and the other to the NF200 at x16 2.0; but then the NF200 will chop that into x16 1.0? A 'base' X58 with autoswitch to only drop to x8 when a third card is inserted would be better here, because then you'd get your full 8 GB/s for two cards.

NF200 is PCIe 2.0. NF100 was PCIe 1.1. It's 8 GB/s to the NF200 chips and from there 8 GB/s to each of the two cards. If you're using all 4 slots, then it's 4 GB/s to each slot.

If you look at Everest with a 9800GX2, you'll see each of the cards in the 9800GX2 has a full PCIe 2.0 x16 connection because of the NF200 chip on the card. Same thing applies here.

bassman · Jan 7, 2009

Elledan said:
Sure, if you need to cram full the 4 GB VRAM of a Tesla card with data, then yes, more bandwidth will decrease the total time spent on the process, but if this is your primary concern, and actually significantly impacts your application, you should look at options like streaming in the data or using chunks.

That's an oversimplification of a single case, and isn't the only case where PCIe bandwidth is important. Even streaming and chunking could benefit from more PCIe bandwidth depending on the application and amount of data.

My point still stands: claiming two Tesla GPUs on a x8 connector doesn't "slow down computing" is a broad statement and not always true. The implication that Tesla overall is not bound by PCIe bandwidth is erroneous. Finally, even if the claim was correct about Tesla and bandwidth, applying it to back up results in a graphics performance test is a fallacy. The scenarios are too different.

Aielman · Jan 7, 2009

GoodBoy said:
I may still go with the P6T6 WS board, simply because I want the ability to run 2 way SLI, and at the same time use my x8 raid adapter. This is where having the extra lanes is a good thing. I've been having trouble finding a board with the necessary slots available in the correct physical locations (so that it all fits) while letting me use 2 smaller x1 cards (5 pcie slots needed). I still wish the x58 had been designed with more lanes, so that mobo manufacturers would have the freedom to design a board with 6 pcie slots.

Yeap...that's pretty much where I am as well. The layouts of the other boards preclude me using them for my planned build.

Cyant · Jan 7, 2009

I made a drinking game out of the video. Everytime you hear "by 16" you drink! Let's the game begin!

Brahmzy · Jan 7, 2009

Aielman said:
Yeap...that's pretty much where I am as well. The layouts of the other boards preclude me using them for my planned build.

So, is there any latency involved when using 2-way SLi?

I have a a pcie1x X-i T, and obviously 2 dual-slot GPUs if I go SLi... what would the best slots be used for this? And again, would I be bypassing any latency of the NF200?

So...
Slot 1 - GPU 1
Slot 2 - Empty
Slot 3 - GPU 2
Slot 4 - Empty
Slot 5 - Empty
Slot 6 - X-Fi Ti

Would the NF200 even be used in this config?

ND40oz · Jan 7, 2009

Brahmzy said:
So, is there any latency involved when using 2-way SLi?

I have a a pcie1x X-i T, and obviously 2 dual-slot GPUs if I go SLi... what would the best slots be used for this? And again, would I be bypassing any latency of the NF200?

So...
Slot 1 - GPU 1
Slot 2 - Empty
Slot 3 - GPU 2
Slot 4 - Empty
Slot 5 - Empty
Slot 6 - X-Fi Ti

Would the NF200 even be used in this config?

Slots 1 and 3 are both from NF200.

ilkhan · Jan 7, 2009

I think itd be interesting to run a test of 2x16 SLI vs 2x8 SLI. See in a two card situation how big of a difference it makes. Similar to how this is 16+2x16 vs 16+8+8 situation. And would be a base predictor for lynnfield SLI setups.

MetalDwarf · Jan 7, 2009

Hey Kyle what was the name of the band/song from the start of the video?

NF200 "True" 3-Way SLI Preliminary Results @ [H]

Supreme [H]ardness

[H]ard|Gawd

[H]ard|Gawd

[H]ard|Gawd

[H]F Junkie

[H]ard|Gawd

[H]ard|Gawd

Limp Gawd

Limp Gawd

[H]ard|DCer of the Month - April 2011

[H]ard|Gawd

[H]ard|DCer of the Month - April 2010

2[H]4U

Limp Gawd

2[H]4U

[H]F Junkie

2[H]4U

[H]F Junkie

[H]ard|Gawd

Limp Gawd

2[H]4U

[H]F Junkie

Just Plain Mean

[H]ard|Gawd

[H]ard|Gawd

Gawd

Supreme [H]ardness

Just Plain Mean

[H]ard|Gawd

[H]ard|DCer of the Month - April 2010

Gawd

2[H]4U

[H]F Junkie

[H]ard|Gawd

n00b

Gawd

Supreme [H]ardness

[H]F Junkie

[H]F Junkie

[H]ard|Gawd