Possible explanation (excuse) for horrid single-threaded DB performance...read on!

Even still, most apps are going mutli threaded and most the past year have been. Too little too late?
 
Not sure what to make of that? Some one explain?

Bulldozer is basically 4, dual-core modules sharing same cache. Supposedly this is not optimal design, and causes tug-o-war "thrashing" between the cores. The poser disabled one core on each module, making it a 4 core CPU with it's own "dedicated" cache. The results are represented by the red bar in the graphs, and as you can see there is improvements to be made... Software (ie OS) and/or BIOS fixes could address this issue. I hope..:)

attachment.php
 
Hmmm, interesting.

Thrashing between L1 and L2 cache across modules?

Bulldozer is basically 4, dual-core modules sharing same cache. Supposedly this is not optimal design, and causes tug-o-war "thrashing" between the cores. The poser disabled one core on each module, making it a 4 core CPU with it's own "dedicated" cache. The results are represented by the red bar in the graphs, and as you can see there is improvements to be made... Software (ie OS) and/or BIOS fixes could address this issue. I hope..:)

With UEFI and such I imagine it could be done.
 
I doubt even then it could compete with a 2500k, let alone itself. It's fighting itself and Thuban.
 
I would think that AMD knew all about this, and if they wanted to, they could have released microcode to "fix" this issue (ie, make the "8" core CPU into a "4" core CPU), so there is a plausible explanation someplace in AMD's R&D department... or maybe the marketing guys yelled too late, we already said it has "8" cores!
 
Whooaaaa...somebody discovered that in dual core CPUs sharing the cache, disabling a core improves performance!! I've could have told him that without bothering to test it...

Take any Core 2, disable one of the cores and compared single threaded perf when both cores are active vs. only one active. Who will be surprised at the result ?
 
If possible

1. AMD-overdrive/BIOS/UEFI/firmware
2. Dynamically monitoring the system load. If you want, you can use the same PROFILE idea as well
3. Check the graph, for 4Module(8-int-core+4-shared-fp-core), the score 11K average per core (divide by 8) is about 1500
4. For 4Module(4-int-core+4-shared-fp-core) (meaning one int-core disabled per module), the score is 8800+, average 2200 per core.
5. Thus if you have light load, acceptable multi-threading condition, you will be much better
5.1 Activate AMD OverDrive : CoreDormant function calls to disable core 2,4,6,8 and cut power to absolute minimal for those core-area.
5.2 Activate AMD OverDrive : SwitchPowerProfile, now to ramp the clock on remaining 4-int-core + 4-256bit-fp-core in 4 module.

6. The design of Bulldozer Philosophy is very comprehensive
6.1 If you are believer of "Most Program two threads", deactivate 2 modules, switch 2 module to 1-int-core+1-fp-core, ramp the clock. This is Pentium Emulation Mode

6.2. If you must break record, deactivate 3 modules, switch 1 module to 1-int-1-fp-core. Ramp the clock. This is NetBurst Celeron World Record Mode. 8GHz,

Bulldozer 4Module is to illustrate, Athlon II, Phenom II, P4, C2D/C2Q-65/45nm, Core i3/i5, BD-4M can be reconfigured to meet any needs.

6.3 You can dynamically switch to any functional MODES when situation demands it. In the most complete sense, you can re-program the the entire OverDrive/UEFI/BIOS/Firmware.

So the 4Module is actually extremely flexible. However, for some, as clearly shown, may prefer FX-6100 for intermediate choice that addresses all common issues just fine.

Another fine-point is SSE-4.1/SSE-4.2 finally available. Software support may take a while if old software only check CPU-type and not actual SSE function. Phenom II and older generations only supports upto 4a, which is minimal subset in comparison to what you have in C2D/C2Q/Core-i3/i5/i7 previously.
 
Last edited:
I doubt even then it could compete with a 2500k, let alone itself. It's fighting itself and Thuban.

Dunno, I'd like to see Kyle and crew try it with disabling cores/Bulldozer modules and get a real mild overclock, say 5-6ghz and see what happens.

Edit: I think we might also see a performance hit due to the way Windows 7 does task scheduling since each Bulldozer module is two cores with shared L2. If a thread is on core 0, module 1 and it needs to communicate with a thread on another module.
The scheduler can re-schedule the thread and do all the fancy stuff like moving stack frame and everything to core 1, module 1. The problem is each of the four Bulldozer modules share L2 cache. This differs from the other architectures where every core has its own L2.
 
Last edited:
without sharing resources between integer cores, it gains ~20% performance per core in the 4mod/4core model compared to the 2mod/4core.

However, it's still shitty because it starts with a 20% IPCpC deficit compared to Phenom II already.
so, at best, it matches K10.5.
But SB still spanks it soundly. The excuse is not good enough.

AMD had 2 years to optimize FX for windows 7.
blaming windows 7 for being unoptimized is just scapegoat tactics.
 
Last edited:
without sharing resources between integer cores, it gains ~20% performance per core in the 4mod/4core model compared to the 2mod/4core.

However, it's still shitty because it starts with a 20% IPCpC deficit compared to Phenom II already.
so, at best, it matches K10.5.
But SB still spanks it soundly. The excuse is not good enough.

AMD had 2 years to optimize FX for windows 7.
blaming windows 7 for being unoptimized is just scapegoat tactics.


i don't think anyones blaming windows 7 for it. it just means there needs to be a update so that windows 7 can use bulldozer better which i'm sure we will probably see an update here in the next week or so as an optional AMD processor driver update. not the first time its happened and won't be the last.
 
But SB still spanks it soundly. The excuse is not good enough.

AMD had 2 years to optimize FX for windows 7.
blaming windows 7 for being unoptimized is just scapegoat tactics.

I'm not trying to be an apologist, I don't root for any particular team. Hell, my next system is RISC based (c'mon November!). The architecture of this processor is quite different though. I'm not blaming Windows 7, I'm just glad 8 will have better thread scheduling in general.

But it is sad that Phenom II 9xx BE is on par with FX-8150 in single threaded performance in its default configuration of 4 modules/8 cores.
 
without sharing resources between integer cores, it gains ~20% performance per core in the 4mod/4core model compared to the 2mod/4core.

However, it's still shitty because it starts with a 20% IPCpC deficit compared to Phenom II already.
so, at best, it matches K10.5.
But SB still spanks it soundly. The excuse is not good enough.

AMD had 2 years to optimize FX for windows 7.
blaming windows 7 for being unoptimized is just scapegoat tactics.

dunno what benchmakrs your reading , but SB doesn't spank it soundly .

http://www.tomshardware.com/reviews/fx-8150-zambezi-bulldozer-990fx,3043-15.html

http://www.anandtech.com/show/4955/the-bulldozer-review-amd-fx8150-tested/10

in alot of tests its faster than the i5 2500k
 
the IFs,
IF there's some kindof driver so that Windows 7 only see 4 threads - the driver manages scheduling in each modules' cores internally,
IF there's some kindof driver that optimizes scheduling of Windows 7.
IF somehow Windows 8 preview already have optimized scheduling, it could be used for benchmark, maybe better gaming with BD?

the last IF has higher chance of happening, the first two requires too much effort from developers, unlikely to happen.
 
i don't think anyones blaming windows 7 for it. it just means there needs to be a update so that windows 7 can use bulldozer better which i'm sure we will probably see an update here in the next week or so as an optional AMD processor driver update. not the first time its happened and won't be the last.

I don't get why everyone is thinking this will be the holy grail for AMD. If this patch was so much better for BD performance AMD would of delayed the launch by another week. All the reviews are on the web now showing how bad BD is. There is no way a simple patch is going to recover them from this. Now its time to wait for the Piledriver threads on how it will be better and everything will be good again.
 
That makes it 2 billion transistor 315 mm^2 4 core part that competes with 900 milion transistor around 200 mm^2 Sandy Bridge quad cores.

And SB has GPU built into it.

And BTW isn't bulldozer doing it automatically? I kinda assumed that if it's loaded with 4 threads they would be sent to full cores first ?
 
the Piledriver threads on how it will be better and everything will be good again.

AMD has lost a lot of enthusiasts with false promises on the Barcelona release. This is the second time AMD has been a complete let down. The performance is mediocre in multithreaded benchmark, but the power consumption is just wow. Also 2 billion transistors to achieve this result doesn't leave much room for faith they'll be able to fix it.

But who knows maybe they can pull what nVidia pulled post Fermi.
 
without sharing resources between integer cores, it gains ~20% performance per core in the 4mod/4core model compared to the 2mod/4core.

However, it's still shitty because it starts with a 20% IPCpC deficit compared to Phenom II already.
so, at best, it matches K10.5.
But SB still spanks it soundly. The excuse is not good enough.

AMD had 2 years to optimize FX for windows 7.
blaming windows 7 for being unoptimized is just scapegoat tactics.

It wouldn't surprise me if they optimized for Linux (I'm still waiting to find someone posting virtualization benchmarks, apache benchmarks, etc) since this same arch will e used in C32 & G34 cores where $$$$ is much much better than the desktop.
 
that link to xs is down any one have another one?

You're in luck. I saved the poster's comments and pics....;)

AMD FX "Bulldozer" Review - (4) !exclusive! Excuse for 1-Threaded Perf.

What I'm about to deal with here is comparing 2CU/4C and 4CU/4C Bulldozers.
(CU stands for Compute Unit, or equivalently 'Module')
It can be an excuse for Bulldozer's initial poor single-thread performance:
benchmark tools are just under-optimized for that kind of architecture, but not virtually a 'poor' performance if some optimization is done.


2c4t.jpg


4c4t.jpg


ctc_01_fritz.png


ctc_02_wprime.png


ctc_03_winrar.png


ctc_04_3d06.png


ctc_05_3dv.png


ctc_06_3d11.png


ctc_07_cb10.png


ctc_08_cb115.png


ctc_09_blender.png


ctc_10_x264.png


ctc_11_potencoder.png


Again, red bar represents a Bulldozer with one core in each module disabled, making it a quad-core CPU with it's own (not shared) cache...
 
I have a question.

Do we know this is really the case and system isn't aware of which cores are full cores and which ones are half cores ?

Because if we have 4 threaded load and we disable 2 modules OS must load it to 2 full cores and two half cores. Now if we disable all half-cores in bios all threads go to non shared resources.
But are we really sure that if we have all 8 cores active that system will sent them to both types of cores?

It should really be basic preparation for AMD with such a huge architecture change to make sure windows is aware which ones are full cores.

Could someone who has CPU run some kind of benchmark which can be set for specific number of threads test all 3 scenarios ?
 
......

Could someone who has CPU run some kind of benchmark which can be set for specific number of threads test all 3 scenarios ?

Would be nice if somebody did that, the problem is I'm not foreseeing a lot of people running out and getting Bulldozers right now....:(

I truly believe that AMD has something here that can be addressed/fixed/fully utilized with better software (updates/optimizations) and then we may see what it truly brings to the table... The major concern for me, though, is ridicules power usage....:eek: That's something they will NOT be able to address until the next die shrink, unfortunate...:rolleyes:
 
mzs, if there was something in the pipeline like that wouldn't AMD at least make that clear? I see no reason why they would not. AMD releasing BD in the state it is in my opinion proves there is no magic patch that will fix it. This sounds like the 2900 XT all over again that the drivers would make it bitch slap the 8800 GTX.
 
mzs, if there was something in the pipeline like that wouldn't AMD at least make that clear? I see no reason why they would not. AMD releasing BD in the state it is in my opinion proves there is no magic patch that will fix it. This sounds like the 2900 XT all over again that the drivers would make it bitch slap the 8800 GTX.

lol, this is much better than the Fermi comparison!
 
Indeed. At least Fermi had the performance crown even if it was late and was a power hog. The 2900 XT was over hyped (320 SP anyone?), late, power hungry, and at times was matching the 8800 GTS 640 if I am not mistaken. Of course if you wanted to bench 3DMark06 the 2900XT would be the best. If people do Winrar all day or want to get to 8GHz it maybe a suitable CPU.
 
Bulldozer is basically 4, dual-core modules sharing same cache. Supposedly this is not optimal design, and causes tug-o-war "thrashing" between the cores. The poser disabled one core on each module, making it a 4 core CPU with it's own "dedicated" cache. The results are represented by the red bar in the graphs, and as you can see there is improvements to be made... Software (ie OS) and/or BIOS fixes could address this issue. I hope..:)

attachment.php

The link is down right now so I can't read for myself, but maybe this is also partially explained bt each module getting its own FPU?

Do any of these benchmarks rely on AVX?
 
Last edited:
I have a question.

Do we know this is really the case and system isn't aware of which cores are full cores and which ones are half cores ?

It should really be basic preparation for AMD with such a huge architecture change to make sure windows is aware which ones are full cores.

Could someone who has CPU run some kind of benchmark which can be set for specific number of threads test all 3 scenarios ?

Basically, Windows 7 OS Task Scheduler just kinda plops things at random to the next available core. The way Bulldozer is designed... it is to benefit to have a thread and its child running on the same core or in cases where a process has to communicate the same data. I don't know what W7 does in that event, I guess you could reschedule threads and make sure a parent and its forks are in the same module...

The only way AMD could have done anything was to ask Microsoft nicely to basically rewrite Windows 7's OS Task Scheduler... There is speculation that the entire processor thrashes at some point with cache. Bulldozer's multithreaded performance is pitiful and the performance in general is erratic. It seems to do well when you have a massively parallel application running and hammering instructions left and right. Anything else?
Nope!

But I'm curious to see what happens when you have an operating system who has a task scheduler with some sense of intelligence to take advantage of the quirky design of Bulldozer and you run the processor in 4 cores/4 modules mode (turn off a core in every module because it seems to help single threaded performance).

And how making the modules fly solo how that effects temperature and power draw.
What the hell is the nominal voltage for this thing? And 125w TDP my ass.
The only saving grace will be fixes in steppings and unlockable modules, even if one core inside the module is bad. No biggie, because most desktop stuff is single thread intensive anyway.

I'm tired and I stayed up all night, when I wake up I'll read more about it tomorrow.

Zarathustra[H];1037872756 said:
The link is down right now so I can't read for myself, but maybe this is also partially explained bt each module getting its own FPU?

Do any of these benchmarks rely on AVX?

The x264 encoding benchmark I think was AMD supplied and supposedly supports AVX, and it only does well on the thread heavy second pass. So bizarre.
 
Basically, Windows 7 OS Task Scheduler just kinda plops things at random to the next available core. The way Bulldozer is designed... it is to benefit to have a thread and its child running on the same core or in cases where a process has to communicate the same data. I don't know what W7 does in that event, I guess you could reschedule threads and make sure a parent and its forks are in the same module...

Wouldn't Intel also benefit from this improved scheduler. I mean, their HT threads are both on the same core. You'd think they'd benefit even more...

Also, if this improved scheduler has shown up in Windows 8, you'd think they could patch it to Win7 with relative ease. Next service pack maybe?
 
Indeed. At least Fermi had the performance crown even if it was late and was a power hog.

No it didn't. 5970 still wiped the floor. Fermi 400-series is almost the same as this flop. Except for AMD, instead of woodscrews, you get ridiculous marketing slides.
 
No it didn't. 5970 still wiped the floor. Fermi 400-series is almost the same as this flop. Except for AMD, instead of woodscrews, you get marketing slides.

For a single gpu card yes Fermi did have the performance crown. From what I saw more people had a GTX 480 for multi gpu as well. The 5800 series had some serious crossfire issues by the looks of it. H showed that on quite a few occasions in reviews too.
 
For a single gpu card yes Fermi did have the performance crown. From what I saw more people had a GTX 480 for multi gpu as well. The 5800 series had some serious crossfire issues by the looks of it. H showed that on quite a few occasions in reviews too.

Yup. Seriously, you have to be falling for AMD's marketing fluff if you believe the HD5970 is nothing more than two underclocked HD5870 on one card :p

On the other hand, it is a valid arguement... the HD5970, despite it's many issues, was still the faster single card in many cases :eek:

Just like how MCM CPU could be accepted, IMO. But maybe not :D
 
Zarathustra[H];1037872897 said:
Wouldn't Intel also benefit from this improved scheduler. I mean, their HT threads are both on the same core. You'd think they'd benefit even more...

Also, if this improved scheduler has shown up in Windows 8, you'd think they could patch it to Win7 with relative ease. Next service pack maybe?

Hopefully. I'm just crossing my fingers for backlash and price drops or easily unlocked 4100s along with improvements in power efficiency... It is a bit painful to see AMD pooch it so badly. I mean you think that the engineers would have thought of this stuff ahead of time and QA would have done a plethora of tests and gone "Hmm, something isn't right here..."

If you think about it, if you have each module with their two cores sharing that 2MB of L2 cache, it's really 1MB per core. And we don't know the nitty gritty. There could be situations where a thread is loaded onto core 0 and a thread in core 1 (same module) and maybe we're encountering a situation where core 1 is overwriting the cache for its own hits and then that causes slow down for core 0 (the assumption is that the threads in the same module have nothing to do with one another, they're not related threads, not a fork, not a child process)

But I'm not a micro engineer, sadly. :(
 
Fermi was bad, but nowhere near as bad as this. At least Fermi had great tessellation performance. What's Bulldozer claim to throne? ~2 billion transistors with nothing to show for.
 
Fermi was only bad heat wise, but if you can cool it down it's great.
Not to mention the power bill. But yes, disregarding power and heat, the original Fermi was very good performer.
What's Bulldozer claim to throne? ~2 billion transistors with nothing to show for.
Exactly my sentiments.
 
No it didn't. 5970 still wiped the floor. Fermi 400-series is almost the same as this flop. Except for AMD, instead of woodscrews, you get ridiculous marketing slides.

Yeah, but the 5970 was a dual GPU design subject to all the problems (compatibility and scaling) that come along with that.

The best comparison to a 5970 would be a 400 series in SLI. The 470 and 480 both killed the 5970 in SLI. The later cheapo 460 roughly tied it on average in SLI.

I mean, wasn't the 5970 almost $800 in the beginning of 2010 when the 400 series was launched?

As I recall, the MSRP was $600 but retailer price gouging due to limited availability forced that price up a bit.

You could have gotten two GTX470's and put them in SLI for that, and gotten better performance in the vast majority of benchmarks.

Te fact that the two GPU's are on the same card doesn't really matter that much.

If you had unlimited funds to spend, you could have quad SLI'd GTX480's at launch, and they would definitely had been faster than two 5970's or 4 5870's (as I recall 4 GPU's was the max for both Nvidia and AMD for SLI/CF back then and still is now)
 
Back
Top