AMD Anti-Hyperthreading

theoritically, if optimized correctly. a 1.8ghz dual core can effectively run like a 3.6ghz single core. wow.. imagine that...
 
Anybody here remember Digital's FX-32!?

AMD has some ex-alpha guys on staff now. Seems to me that Pacifica may be the hardware version. Or at the VERY least it provides the environment needed for a hardware implementation. Sounds alot like this so called anti-ht

Just think about the new coprocessor. How is AMD going to get this beast working without overhauling the ISA? Or at the very minimum introduce a new MASSIVE extension to the legacy ISA? Which AMD has only done twice before. Once with 3dnow, which was only done to fix flaws in mmx, and once more with AMD64, which was done to fix flaws in the legacy ISA. I really dont think they are going to be modifying the ISA anytime soon.

I think Pacifica provides the foundation for a hardware based runtime compiler and interpreter. The first run is usually pretty slow, but subsequent runs are almost native. Sounds interesting, and would provide the massive amount of logic needed to do something like this. There is some information out that a new extension will be released in the future, but If I'm right, then it may be to get this environment working effectively

What do you guys think?
 
ValeX said:
Finally, a use for 4 cores! We could either make it into 2 faster cores, or one reallllyyy fast core.....haven't clicked the links yet, might end up editing this comment :p

ValeX
That’s exactly what I was thinking! If they can pull off 3GHz per core on a quad core processor with the AM2 die-shrink, and combine the four cores into two sets of two, you effectively have a 6GHz (virtual) duel core :eek:

One set made with all four cores would theoretically net you a 12GHz single (virtual) core! If I may quote UT2004 "HOLY SHIT!!!"

I'll admit, I'm an AMD f@nboy, but if they can pull this off (without any major overhead) the duel core Conro is in trouble.
 
I can't see this being a benifit to the average thread. Hopping across the SRQ to access the other CPU's cache every once in a while for a mutex / semaphore isn't a big deal. But doing it everytime there's a dependancy of an instruction being executed on that core? It just doesn't seem to have a benifit to me.
There may be some legacy single threaded apps that remain highly parrallel that this could work for. But the average thead, picking 3 or 4 instructions per cycle to execute is hard, picking 6 where half are exectued on a seperate core - adding signifigant latency to accessing the results - I don't see it.
 
Better to do this in hardware than have the coders spend more time multi-threading their code.
 
serbiaNem said:
Better to do this in hardware than have the coders spend more time multi-threading their code.

This is not, and will never be a substitute for multithreading.
Threads can go and do their own work compltely independant of one another for maybe millions of clock cycles.
Instructions invariably have dependances 5, 50, 500 clock cycles down the line. That's way to much waiting for results to be shared between the cores. There are a few interesting possiblities (like faking branch predications using two cores), but with AMD's dual core architecture the way it is today - this is a total gimic, and a non-issue as far as I see it.
 
FreIDog outsmarted the entire AMD Research and Development Team... Holy smokes!!!!
 
That why I say a runtime compiler may be the right method. After all it would be able to track instruction paths and optimise new instructions on the fly to match the current situation all in hardware. I still recall FX!32, while it was a software interpreter, it pretty much did just that, and it was able to do it at almost full native speed.

I think this would do the trick. Also recall the runtime compiler that NEC made. It runs in hardware. So it has already been done. I see no reason why AMD couldnt also do it.

Also I honestly think there is no way in hell that any human is capable of multithreading an application to the extent that AMD's new coprocessor would need. It has something like 64 symetric cores (correct me if I'm wrong) for cryin out loud. There just simply is no way that a human would be able to code for that. It almost has to be automated in hardware.
 
duby229 said:
That why I say a runtime compiler may be the right method. After all it would be able to track instruction paths and optimise new instructions on the fly to match the current situation all in hardware. I still recall FX!32, while it was a software interpreter, it pretty much did just that, and it was able to do it at almost full native speed.

I think this would do the trick. Also recall the runtime compiler that NEC made. It runs in hardware. So it has already been done. I see no reason why AMD couldnt also do it.

I don't know that a hardware runtime compiler would be of all that much benifit here either. You can't just 'shake the tree' so to speak and expect 2 threads to fall out. Independance between code fragments is really something that has to be done at design time. I don't know what NEC did with it's hardware compiler, but Alpha's FX32, while not neccessarily a one to one translation, each instruction under FX32 has a coresponding instruction or instrctuion chain under Alpha's 64 bit ISA. There is no inherent multhreaded or parallel version of find the node in the list, or factor a large number for RSA encryption. Those types of thigns can either be done in parallel or not. If they can be run in parallel, it's easy to do so in their native x86 format, if not, no amount of re-compiling in hardware is going to change that.

Now if AMD wanted to run their own bytecode language on this hardware and embed lots of dependancy information in their bytecode there's the potential to smartly schedule diferent dependancy chains on different cores. Doing so without additional information however is hard at best given how messy x86 can be in shuffeling data around its limited register set.

A hardware compiler, probably not needed, but hardware to search through for dependancy chains and try to identify code fragements that can be executed way out of order (maybe look at the next 200 instructions instead of the next 10 or 15 the current decoders examine), is something to look into. Again I just don't think there are enough instructions in your average thread that can be identified, moved to the other core, executed and moved back without disrupting the first core to make this more than a marketing gimic.


Also I honestly think there is no way in hell that any human is capable of multithreading an application to the extent that AMD's new coprocessor would need. It has something like 64 symetric cores (correct me if I'm wrong) for cryin out loud. There just simply is no way that a human would be able to code for that. It almost has to be automated in hardware.

I think you're thinking of something else. AMD's coprocessor (in addition to being nothing but rumor yet - they've talked, maybe signed agreements with I'm not sure, a few companies that make specialized processors), isn't a specific device. It's the ability to plug a specialized device (vector processor, encryption, advanced math functions for special purposes) into a S940 and have the Opteron interact with it over a CPU <-> CPU HTT link. Most of those you'll probably want supported by the software compiler to take proper advantage of the coprocessor.
 
FreiDOg said:
I don't know that a hardware runtime compiler would be of all that much benifit here either. You can't just 'shake the tree' so to speak and expect 2 threads to fall out. Independance between code fragments is really something that has to be done at design time. I don't know what NEC did with it's hardware compiler, but Alpha's FX32, while not neccessarily a one to one translation, each instruction under FX32 has a coresponding instruction or instrctuion chain under Alpha's 64 bit ISA. There is no inherent multhreaded or parallel version of find the node in the list, or factor a large number for RSA encryption. Those types of thigns can either be done in parallel or not. If they can be run in parallel, it's easy to do so in their native x86 format, if not, no amount of re-compiling in hardware is going to change that.

Here is my theory. While I could be and most likely will be wrong on alot of this, I think it could work. Say that AMD takes a concept like FX32, but rather then using it to convet binaries from one ISA to another, they use it to convert x86 and all of its extensions into native macro code. Both cores, can talk to each other and colaberate on the runtime compiler, and determine which core would execute what, based on how close the instruction chain is to that particular core, and with a large unified L3 this could be theoretically possble. Also I'm expecting that at the 65nm node AMD will prolly integrate the coprocessor on die.

What it wouldn't be able to do is work across sockets, so in a multisocket system then each socket, rather than each core would have its own thread.

Now if AMD wanted to run their own bytecode language on this hardware and embed lots of dependancy information in their bytecode there's the potential to smartly schedule diferent dependancy chains on different cores. Doing so without additional information however is hard at best given how messy x86 can be in shuffeling data around its limited register set.

I don't think it is a matter of changing the byte code at all. AMD has been able to adopt every extension that has ever been made for x86 using the same byte code for years now.

However that does bring in an interesting question for what the purpose of the coprocessor will actually be used for?

A hardware compiler, probably not needed, but hardware to search through for dependancy chains and try to identify code fragements that can be executed way out of order (maybe look at the next 200 instructions instead of the next 10 or 15 the current decoders examine), is something to look into. Again I just don't think there are enough instructions in your average thread that can be identified, moved to the other core, executed and moved back without disrupting the first core to make this more than a marketing gimic.

I agree for the most part. But what if the two processors were some how able to retire instructions independantly of each other, even within the same thread? It would require some work to be done on how the OoO system works, but would prevent the need to move data back and forth, and with a good compiler, that could predict code paths far enough in advance it could make sure that the proper instructions were already at the proper core before execution.

I think you're thinking of something else. AMD's coprocessor (in addition to being nothing but rumor yet - they've talked, maybe signed agreements with I'm not sure, a few companies that make specialized processors), isn't a specific device. It's the ability to plug a specialized device (vector processor, encryption, advanced math functions for special purposes) into a S940 and have the Opteron interact with it over a CPU <-> CPU HTT link. Most of those you'll probably want supported by the software compiler to take proper advantage of the coprocessor.

For the time being that is true, but I am fully expecting it to get integrated on 65nm. Prolly not on the initial core revision, but subsequently it almost certainly will be.



Any how I can see there are some major holes in my logic. And it may not work the way that I think it might. Thanks man. I appreciate people that are smarter then me so I can learn from them ;)
 
duby229 said:
Here is my theory. While I could be and most likely will be wrong on alot of this, I think it could work. Say that AMD takes a concept like FX32, but rather then using it to convet binaries from one ISA to another, they use it to convert x86 and all of its extensions into native macro code. Both cores, can talk to each other and colaberate on the runtime compiler, and determine which core would execute what, based on how close the instruction chain is to that particular core, and with a large unified L3 this could be theoretically possble. Also I'm expecting that at the 65nm node AMD will prolly integrate the coprocessor on die.

What it wouldn't be able to do is work across sockets, so in a multisocket system then each socket, rather than each core would have its own thread.

You want the compiler to grap a large chunk of instructions, then dole them out to the two cores on a per instruction or per dependency chain basis?
We're almost getting to where there's a massive decoder on the front end here, except instead of x86->u ops like the CPU decoders do, it's x86 stream -> x86 fragments for each core to work on.
You talk about the cores communicating (presumably to share results to dependant operations) and a unified L2 (presumably where data and code would be pooled before being sent to each core, and where results would be written back too). I'm not sure if accessing the other core over the SRQ or read/write to unified L2/L3 is quite and order of magnitude slower than working with L1, but it's close. It's just not something you can do on a per instruction, or per block of 5 or 10 instructions basis. The overhead for that kind of synchronization is far and away greater than any performance benefit from running some instructions on the second core.


I don't think it is a matter of changing the byte code at all. AMD has been able to adopt every extension that has ever been made for x86 using the same byte code for years now.

I meant byte code in the sense of MSIL or Java. Though obiviously not abstracted from the hardware instructions, but you would need to embed large amounts of information about the instructions along with the instructions. Data dependencies, instruction dependencies, scheduling hints, ect. An interpurter on the front end would grab large sections of the byte code - several hundred instructions at a time and send large code fragments to each core to execute with total independance of one another. The instruction streams sent to the cores would need to be properly ordered as to provide dependant results as early as possible, and to minimize cache and memory read latency.
I just don't believe it's feasible to find 30, 40, 50 instruction fragments in a block of two or three hundred to send to the second core. Sequential code just has too many interdependancies to make fracturing it like that work. And without those large blocks there's nothing to hide, or offset the high cost of core <-> core communication.

However that does bring in an interesting question for what the purpose of the coprocessor will actually be used for?

That depends entirely on what type of coprocessor it is. I expect the HTT link to offer quite a bit of flexibility in what can be plugged in.

I agree for the most part. But what if the two processors were some how able to retire instructions independantly of each other, even within the same thread? It would require some work to be done on how the OoO system works, but would prevent the need to move data back and forth, and with a good compiler, that could predict code paths far enough in advance it could make sure that the proper instructions were already at the proper core before execution.

Each core would have to be able to retire the instructions it exectued on its own. What else could happen to them but be retired by the core they were executed on? There's no mechanism to lift them out of the pipeline mid way through and move them to the other core.
You're still going to have to move the results back and forth because dependant instructions are going to exist most of the time. Dependancies happen all the time, and if you try and make sure that every instruction that's dependant on a previous result gets executed on the core that computer the result, all your instructions are going to end up on 1 core real quick.

Cores have to exchange information, even in multithreaded applications they have to do that. It's just in a single thread over mutliple cores the likelyhood is they do it much to often, resulting in a lot of time wasted.

Any how I can see there are some major holes in my logic. And it may not work the way that I think it might. Thanks man. I appreciate people that are smarter then me so I can learn from them

holes, errors or crazy talk, I much prefer a discussion like this to the repetitive 'should I buy an Opteron 165' thread. :)
 
Opajew said:
this whole thread confuses me. Can someone just simplfy it for me please?
k here it goes:

AMD is supposedly working on a technology that essentially is the polar opposite of typical multithreading: instead of having two cores run two different threads in parallel, the two cores would run one thread as though the two cores were only one single core. Meaning that, in theory, processor speeds actually could be additive (with current SMP dual 2.0ghz cpus does not mean a 4.0ghz cpu, right? well with this it kind of could...)

However, this kind of processing is completely and utterly out of the scope of the standard x86 ISA which would mean that AMD would pretty much have to reinvent the wheel (completely change how instructions are handled internally) while still remaining backwards compatible with the original wheel (it still must APPEAR to run standard x86 ops from the outside) The logistics of doing such a thing are immense...I honestly don't think it could be done without creating a whole new architecture and instruction set.

I guess we'll see with time.
 
Eva_Unit_0 said:
I honestly don't think it could be done without creating a whole new architecture and instruction set.

So, why not do exactly that? Create a new instruction set, add a chip to the new processor that translates to the x86 instruction set (or from), and keep the existing architecture? Eventually, the new instruction set would be supported natively, and the translator chip could be bypassed, or removed completely. If a new architecture IS required, well... The existing one seems to have maxed itself out anyway. We're no longer increasing clock speeds (3Ghz seems to be the max), but rather L2 cache, and number of cores. Shrinking the die may help temporarily, but eventually, something will have to be done.

I really don't see this as a "Conroe-Killer". There's too much to do in too short a time to get it released within a year of Conroe's release.

It seems to me that this "reverse hyperthreading" idea is a far superior notion to forcing every developer on the planet to relearn how to code for multiprocessing systems. (Granted, a lot of them need to learn their chosen language again anyway. Programmers are becoming REAL sloppy these days.) I've great hope that it pans out, even if it's not as big a boost as everyone's claiming. Even a 50% increase would be amazing.

I'm curious how this newfangled who-ha would work with linear code, though.
A=1
B=2
C=3
A+B=X
X+C=Y
Y+X=Z

Assuming we needed Z, there's no way (that I can see) to share this load. Then again, I've never taken any sort of CPU design class, and honestly, I hate programming in general. It could be obvious to a smarter [wo]man than I.
 
Opajew said:
this whole thread confuses me. Can someone just simplfy it for me please?
Dude, we are but noobs when it comes to this kind of stuff - but its still fascinating none the less.
 
TeeJayHoward said:
So, why not do exactly that? Create a new instruction set, add a chip to the new processor that translates to the x86 instruction set (or from), and keep the existing architecture? Eventually, the new instruction set would be supported natively, and the translator chip could be bypassed, or removed completely. If a new architecture IS required, well... The existing one seems to have maxed itself out anyway. We're no longer increasing clock speeds (3Ghz seems to be the max), but rather L2 cache, and number of cores. Shrinking the die may help temporarily, but eventually, something will have to be done.

That's what intel tried to do with the Itanium and IA-64. They introduced a new instruction set that was--on paper--far superior to x86 but was incompatible with x86. They then used an translator chip to achieve x86 compatiblity. Problem is you're assuming that translating from one instruction set to the other can be done effeciently and at a speed that makes it transparent to the cpu (i.e. it is fast enough to feed instructions at the same rate it could be fed native code). It doesn't work like that. The moment you start talking about translating one instruction set to another you open up a whole big can of worms.
 
TeeJayHoward said:
I'm curious how this newfangled who-ha would work with linear code, though.
A=1
B=2
C=3
A+B=X
X+C=Y
Y+X=Z

Assuming we needed Z, there's no way (that I can see) to share this load. Then again, I've never taken any sort of CPU design class, and honestly, I hate programming in general. It could be obvious to a smarter [wo]man than I.
if you have a programmer who wrote that, the world is in trouble:

a+b = x
x+c = y = a+b+c
y+x = z = a+b+a+b+c

all three can be solved independantly.. at the same time ;)
 
Eva_Unit_0 said:
That's what intel tried to do with the Itanium and IA-64. They introduced a new instruction set that was--on paper--far superior to x86 but was incompatible with x86. They then used an translator chip to achieve x86 compatiblity. Problem is you're assuming that translating from one instruction set to the other can be done effeciently and at a speed that makes it transparent to the cpu (i.e. it is fast enough to feed instructions at the same rate it could be fed native code). It doesn't work like that. The moment you start talking about translating one instruction set to another you open up a whole big can of worms.

Transmeta did it with their efficieon line. They translated x86 into their own VLIW instruction format. Of course x86 was their target market to trasmeta tailored their ISA to work with x86, with some success as they didn't do bad against LV and ULV Pentium Ms in certain uses (ultra light / tablet PC stuff).

(cf)Eclipse said:
if you have a programmer who wrote that, the world is in trouble:

a+b = x
x+c = y = a+b+c
y+x = z = a+b+a+b+c

all three can be solved independantly.. at the same time ;)

Right, you wouldn't want to do them on three seperate cores, but you could. Though if X, Y, were never used again the compiler might optimize them out entirely.
 
FreiDOg said:
Transmeta did it with their efficieon line. They translated x86 into their own VLIW instruction format. Of course x86 was their target market to trasmeta tailored their ISA to work with x86, with some success as they didn't do bad against LV and ULV Pentium Ms in certain uses (ultra light / tablet PC stuff).

right but their goal was not to be the fastest...they just needed to make it fast enough to be acceptable. AMD's goal would be to make it fast...as close to native speed as possible. And although Transmeta is a success story even their system could not provide the performance efficiency AMD would need to be looking at. I don't think so, at least.
 
I wonder if this is a method of dodging CPU number limits in versions of Windows, so that they can have say 8 cores and only show up as 2.
 
Langford said:
I wonder if this is a method of dodging CPU number limits in versions of Windows, so that they can have say 8 cores and only show up as 2.

windows doesn't liscence per core, it liscenses per socket. So there's no need to do anything like that. Windows XP pro supports two sockets, so as many cores as you can fit into two sockets is supported. 2 x quad-cores gets you 8 cores in xp pro, as per your example.
 
Langford said:
I wonder if this is a method of dodging CPU number limits in versions of Windows, so that they can have say 8 cores and only show up as 2.
im sure windows vista will reflect the real world, not the world of 5 years ago regarding # of cores.
 
Eva_Unit_0 said:
windows doesn't liscence per core, it liscenses per socket. So there's no need to do anything like that. Windows XP pro supports two sockets, so as many cores as you can fit into two sockets is supported. 2 x quad-cores gets you 8 cores in xp pro, as per your example.

Thats only becasue Microsoft currently chooses to do so. As the popularity of many-number-cores increase, it would not be surprising to see them try to charge more for Windows versions that supported more while dumbing down cheaper versions by removing such support.
 
Langford said:
Thats only becasue Microsoft currently chooses to do so. As the popularity of many-number-cores increase, it would not be surprising to see them try to charge more for Windows versions that supported more while dumbing down cheaper versions by removing such support.

thing to keep in mind, though, is that this is actually a NEW policy of microsoft, not some old relic of past versions. The whole issue never even came up until the XP Home vs. Hyperthreading battle came up a few years ago. I would be surprised if they changed their mind since it's a new policy to begin with, and was made well within sight of multi-core processors being a possibility (and was spurred by pseudo-multi core systems to begin with: hyperthreading)

in fact, they used to liscense per cpu and they changed it to per socket.
 
I hate to bring this up again, but that is pretty much what FX32 did, and it did it at almost native speed.

Put the purpose of FX32 in the context of the purpose of Pacifica..........
 
(cf)Eclipse said:
if you have a programmer who wrote that, the world is in trouble:

a+b = x
x+c = y = a+b+c
y+x = z = a+b+a+b+c

all three can be solved independantly.. at the same time ;)

Haha, that was great.

Now if you could only do that for factoring and break RSA you would be rich.
 
Back
Top