Pipeline

Published on December 2016 | Categories: Documents | Downloads: 83 | Comments: 0 | Views: 508

of 12

describes about concept of pipeline

Content

The pipeline itself comprises a whole task that has been broken out into
smaller sub-tasks. The concept actually has its roots in mass production
manufacturing plants, such as Ford Motor Company. Henry Ford determined
long ago that even though it took several hours to physically build a car, he
could actually produce a car a minute if he broke out all of the steps required
to put a car together into different physical stations on an assembly line. As
such, one station was responsible for putting in the engine, another the tires,
another the seats, and so on.
Using this logic, when the car assembly line was initially turned on it still took
several hours to get the first car to come off the end and be finished, but since
everything was being done in steps or stages, the second car was right
behind it and was almost completed when the first one rolled off. This followed
with the third, fourth, and so on. Thus the assembly line was formed, and
mass production became a reality.
In computers, the same basic logic applies, but rather than producing
something physical on an assembly line, it is the workload itself (required to
carry out the task at hand) that gets broken down into smaller stages, called
the pipeline.
Consider a simple operation. Suppose the need exists to take two numbers
and multiply them together and then store the result. As humans, we would
just look at the numbers and multiply them (or, if they're too big, punch them
into a calculator) and then write down the result. We wouldn't give much
thought to the process, we would just do it.
Computers aren't that smart; they have to be told exactly how to do
everything. So, a programmer would have to tell the computer where the first
number was, where the second number was, what operation to perform (a
multiply), and then where to store the result.
This logic can be broken down into the following (greatly simplified) steps–or
stages–of the pipeline:
This pipeline has four stages. Now suppose that each of these logical
operations took one clock cycle to complete (which is fairly typical in modern
computers). That would mean the completed task of multiplying two numbers
together would take four clock cycles to complete. However, with the ability to
do things at the same time (in parallel) rather than one after another, the result
can often be that while the task itself physically takes four clock cycles to
complete, it can actually appear to be completed in fewer clock cycles
because each of those stages can also be doing something immediately
before and after the first task's needs are met. As a result, after each clock
cycle the output of those operations are “retired” or completed, meaning that
task is done. And, since we're doing things in a pipeline, that means that each

task, taking four clock cycles to complete, can actually appear to be retired
one per clock cycle.
This concept can be visualized with colors added to the previous image and
the stages broken out for each clock. Imagine each color representing a stage
involved in processing a computer instruction, and that each takes four clock
cycles to complete. The red, green, and dark blue instructions would've had
other stages above our block, and the yellow, purple, and brown instructions
would need additional clock cycles after our block to complete. But, as you
can see, even with all of this going on simultaneously, after every single clock
cycle an instruction (which actually took four clocks to execute) is completed!
This is the big advantage of processing data via a pipeline.

This may seem a little confusing, so try to look at it this way. There are four
units, and in every clock cycle each unit is doing something. You can visualize
each unit doing its own bit of work with the following breakout:

Every clock cycle, each unit has something to do. And because each sub-task
is known to only take one clock cycle, by the time the data from the first clock
cycle gets ready to be processed next, it knows the data will be ready
because, by definition, each unit has to complete its work in one clock cycle. If
it doesn't then the processor isn't working like it's supposed to (this is one
reason why you can only overclock CPUs so far and no further, even with
great cooling). And because all of that stuff is working together, four-step
instructions (or tasks) can be completed at a rate of one per clock.
The advantages of this as a speed-up potential should be obvious, especially
when you consider how many stages modern processors have (from 8 in
Itanium 2 all the way up to 31 in Prescott!!). The terms “Super Pipelined” and

“Hyper Pipelined” have become commonplace to describe the extent to which
this breakout has been employed.
Below is the pipeline for the Itanium 2. Each stage represents something that
IA64 can do, and once everything gets rolling the Itanium 2 is able to process
data really, really quickly. The problem with IA64 is that the compiler or
assembly language programmer has to be extremely comprehensive to figure
out the best way to keep all of those pipeline stages filled all of the time,
because when they're not filled the Itanium's performance goes down
significantly:

I was hoping to find an image showing Prescott's 31-stages, but I couldn't.
The closest I found was a black-and-white comparison of the P6 core
(Pentium III) and the original P7 core (Willamette). If anyone has a link
showing Prescott's 31 stages, please let us know.

Here is an Opteron pipeline shown through actual logic units as they exist on
the chip. This will help you visualize how the logical stages shown above for
Itanium 2 might relate to physical units on the CPU die itself:

As you can see, there are different parts to the pipeline all working together,
just like on an assembly line. They all relate to one another to do some real
quantity of work. Some of it is front-end preparation, some of it is actual
execution; and once everything is completed, parts are dedicated to “retiring
data” or putting it back wherever it needs to go (main memory/cache or
something called an internal register, which is like a super-fast cache inside of
the processor itself, or an external data port, etc.).
It's worth noting that the hyper-pipelined design of Intel's Netburst (used in
Willamette through Prescott) has been found to be dead-ended when pushed
to its extreme 31-stage pipeline in Prescott. The reason for this is a penalty
that comes from mis-predicting where the computer program will go next. If
the processor guesses wrong, it has to refill the pipeline, and that takes many
clock cycles before any real work can start flowing again (just like how it takes
several hours to make the first car). Another penalty is extreme heat
generation at the high clock rates seen in Prescott-based P4s.
As a result, real-world experience has shown that there is a trade-off between
how deep your pipeline can be and how deep it shouldbe given the type of
processing you're doing. Even though on paper it might seem a better idea to
have a 50-stage pipeline with a 50GHz clock rate, a designer cannot simply
go and build it–even though it would allow extremely complex tasks to be

completed 50 billion times per second (though with GaAs chips on the way,
that might now be possible).
Chip designers can't do it because there are real-world constraints that
mandate a happy medium between that ideal and the real-world actual. The
most major factor is how the computer program jumps around constantly,
calling sub-routines or functions, going over if..else..endif branches, looping,
etc. The processor is constantly running the risk of guessing a branch wrong,
and when it does it must invalidate everything it “guessed” on in the pipeline
and begin to refill it completely–and that takes away time and lowers your
performance.
The imposed limitations on pipeline depth are simply the side-effect of running
code via the facilities within a processor available to carry out the workload. A
processor just can't do stuff the way a person can. Everything inside a CPU
has to be programmed exactly as it needs to be, with absolutely no margin for
error or guesswork. Any error–any error whatsoever, no matter how small–
means the processor becomes totally and completely useless; it might as well
not even exist.
I hope this article has been informative. It should've given you a way to
visualize a processor pipeline, understand why it is important to performance,
and help you put together how it all works. You should be able to see why
designs like Prescott (which take the pipeline depth to an extreme) often come
at a real-world performance cost. You should also appreciate why slowerclocked processors (such as Itanium 2 at 1.8GHz) are able to do more work
than much higher clocked processors (like Pentium 4 at 4GHz). It's exactly
because of the number of pipeline stages, coupled to the number of available
units inside of the chip that can do things in parallel.
The pipeline allows things to be done in parallel, and that means that a CPU's
logic units are kept as busy as possible as often as possible to make sure that
the instructions keep flying off the end at the highest rate possible.
Keep in mind that there are several other factors that speed-up processing:
processor concepts such as OoO (Out of Order) execution, speculative
execution, the benefits of cache, etc. Stay tuned to ChipGeek for coverage of
those, and keep your inner-geek close by. :)
Post your questions and comments below.
Also, for your reading pleasure, here are some other online articles relating to
pipelines and pipeline stages: Ars Technica on pipelining in
general and Opteron's pipeline; some info on Prescott's die; and a history of
Intel chips and their pipeline depths. This closing graphic will summarize the
trend from the original 8086 through today's Pentium 4. Enjoy!

USER COMMENTS 38 comment(s)
I now know what the problem is! (3:18pm EST Fri Feb 10 2006)
Damm has Ford, GM, and Chrysler figured it out yet. Their pipeline is simply too deep to be efficient
today.
THey need to shorten the pipeline on their manufacturing to improve their performance and compete
better with AMD - by Nice summary Rick
8086 (3:22pm EST Fri Feb 10 2006)
Why is it (8086) considered to have 2 pipelines? Do the integer and FP units each count as a pipeline? I
didn't think it could do simultaneous operations… - by Ray
Most people (3:28pm EST Fri Feb 10 2006)
look at processor pipelines and think of video processors. Not many people think of the main CPU when
referring to piplines insteadthey think MHz. I have 20 or 24 pipelines in my Nvidia chip.
7800GT - by Regulas
pipelines, mis-prediction and thermal dissipation are separate entity (3:46pm EST Fri Feb 10 2006)
why blame mis-prediction on pipeline?!

also, why blame thermal dissipation on pipeline?!
Both mis-prediction and thermal dissipation will be resolved in future, and pipelines will continue to
increase. - by let's be fair
(8086) what (4:21pm EST Fri Feb 10 2006)
2 is refering to a stage count, which varies depending on the operation - by asdf
Huhuh Huhuh (4:47pm EST Fri Feb 10 2006)
He said pipe.
- by –Beavis
re: “piplelines, mis-prediction…” (5:25pm EST Fri Feb 10 2006)
You obviously don't have a very good grasp on computer architecture…branch misprediction is actually
NEVER worse than not attempting a prediction. It is, however, only a problem in pipelined design. If you
don't ATTEMPT prediction, there is no possibility of misprediction, but then we don't have a pipeline
either.
Take another computer architecture course. Prediction is a huge part of design, and will never truly be
resolved. The algorithms just get better and better - by Will
Good article (5:27pm EST Fri Feb 10 2006)
On a tangent, this is one of the better articles that's ever been posted on Geek.com. Keep up the good
work! - by Will
Important issues (5:38pm EST Fri Feb 10 2006)
In a multi-core market the number of pipelines will be important, as well as OoO processing. Processor
features were sorely needed in the early '00s, and there are some nagging issues that are now being
fixed. We've waited a long time for multi-cores to come along. Let's hope some other issues shake out as

well. - by tech
re: “piplelines, mis-prediction…” (5:39pm EST Fri Feb 10 2006)
“why blame thermal dissipation on pipeline?!”

Evedently you don't know shit on pipeline implications. The largest the
pipeline the smaller pipe stages need to be this means each pipe stage can
resolve it's task in less time this is why you can have higher frequency with
larger pipelines. High frequency means trasistor will toggle more often and
thus the dynamic power will increase.
This the reason why see same performance in amd and Intel even though
Intel frequency runs higher.
Just so you know there are many other implications like increase in wasted
time for flop setup/hold times becuase you have more flops … this why you
need latch based design and use time borrowing …
Any ways … go take an advanced computer architecture and literate your self. - by jko
8086 (5:40pm EST Fri Feb 10 2006)
Ok, never mind… I see the 8086 had a prefetch unit and an execute unit comprising the 2 stage
pipeline… - by Ray
re: re: “piplelines, mis-prediction…” (6:06pm EST Fri Feb 10 2006)
“branch misprediction is actually NEVER worse than not attempting a prediction.”

Will:
You observation is current but you lack creativity: with the new multithreading/multi-core machines prediction is necessary. However as we will
see increased number of cores/threads you can build machine that favor nonpredicted addresses over predictions. Meaning you go to the bus for data you
know is good … prediction would be a back up in case your cores run out of
bus requests so I do forsee the dawn of the bpu.
) - by jko
cool article, keep them coming (6:27pm EST Fri Feb 10 2006)
I really liked this article, the quality of your articles gets better all the time, keep it up.

However, I would have liked it even more had it gone into some of the
drawbacks of pipelines (which I know are few). It touches on what happens
when your branch prediction algorithm falls apart (a purely statistical event
which is guaranteed to happen a certain percentage of the time). But I know
there are pipeline events like stalls and other reasons to flush them as well
which are equally destructive to your performance. I'm just not very well
versed in them.
Also, since I seem to be one of the few clock speed evangelist here I've
compelled to point out that no matter what you do in your pipe, the CPU still
executes only one instruction at a time.
In that respect it's very different from a human factory worker assembly line as
no matter what everything goes through the CPU eventually. Kind of like if
Henry Ford (also the inventor of the assembly line) had to autograph every car
before it could ship, no matter what you did in the factory to help him out, your
still limited by how fast he can do his thing.

In a perfect world, there would be no more then one clock delay between your
CPU and it's instruction or data fetches, all adds/multiplies would return the
answer in a single clock, and a pipeline would buy you nothing, in fact it would
just be a FIFO buffer at that point because there would be nothing it could do
to help keep the CPU queued up.
Another perfectly valid way to eliminate wait clocks (aka no op's) at the CPU is
to speed up what's slowing your CPU down in the 1st place, one clock
add/multiply logic is a defining feature of most DSP's as well as a recirculation
multiplier coefficient memory buffer. If your multiply gets done in one clock
then a multiply stage in your pipe and it's logic buys you nothing but latency.
Fetching from a tightly coupled memory (TCM), like a cache, is also an excellent way to reduce the
number of clock delays it takes to fetch or branch inside your TCM. On the StrongARM system I just finish
a cache miss going across the bus could often take 100's even 1000's clocks, no pipeline deep enough to
soften that blow. - by EE
Itanium 2 (6:27pm EST Fri Feb 10 2006)
This is certainly an outstanding article. Thanks for taking the time to illustrate pipelines so clearly.
If I could add that IA64 VLIW (Very Long Instruction Word) puts approx 3 ops per word, that there are 4
parallel pipes, so 4 x 8 = 32 plus some allowance for faster decode of the instruction, call it 35, that
shorter pipeplines and simpler architectures and more intelligent compilers are the go, which we all seem
to have forgotten even though it's the fundamental core of RISC, that Prescott with 125M transistors was
just desperation stuff to try to get to 4Ghz which they abandoned due to power consumption of those 31
stages, and update the table to 2006 to show 1.6GHZ IA64, and dual core now coming April, and that
VLIW and parellel pipelines is the way of the future according to all research (look at ATI/Nvidia
constantly broadening geometry and raster engine stages, not lengthening), and EMT64/AMD64 is
desperately stretching a very old architecture with ugly 64 bit extensions …. it can't last forever, do you
think you'll be computing with IA32/EMT64 in 2010 .. I don't think so …
- by Greg Edwards
RickGeek Critique (6:47pm EST Fri Feb 10 2006)
Thanks for the article Rick. That's more in-depth than I expect from Geek.com (and was a nice
101/Primer for me). I'm used to the Readers Digest types of articles that lead to the usual Fanboy
“AMD/INTEL, ATI/Nvida, Windoze/alt-OS, Great Taste/Less Filling”, etc types of flame wars.
I'm sure a few alleged experts will flame you for anything that doesn't meet their expectations, but I'm
married to an academic and see that kind of criticism all the time. Someone else always knows more.
stay tough and keep posting informative articles. - by Not sick/Not well
Greg Edwards (7:10pm EST Fri Feb 10 2006)
I forgot to state the obvious .. IA64 = Itanium2, sorry. - by Greg Edwards
Man … (7:47pm EST Fri Feb 10 2006)
101ed again. :)

But seriously, the “fundamental core of RISC” is the right pointer.
To add:
It's the system baby: Compiler and Hardware together that counts, there are
places for VLIW (e.g., regular, DSP type processing, DSPs are basically VLIW
processors). Often it takes a long time for the compilers to catch up in
development.

The fundamental limits are in the inherent topology of the code instructions,
that is where it becomes interesting.
Looks like a highly parallel architecture (10^12+ elements) with 6 stage pipline
operating at 100W for average day problems makes a nice target. :)
- by Brainy
Ultimate Goal of Pipelines (8:18pm EST Fri Feb 10 2006)
The ultimate goal is to increase the number of pipelines, through improvement in branching algorithm and
electrical leakage.
Why are there people so literate in Computer Architecture, yet so ignorant in the purpose of pipelines?! by let's be fair
Good Article (9:07pm EST Fri Feb 10 2006)
Keep them coming, always like learning new stuff. Definitely one of the best and most in-depth articles
made here a Geek.com - by SinfulSoul
Thanks for article (9:14pm EST Fri Feb 10 2006)
I am wondering, would multi-core allow shorter pipelines, given it is another path to parallelism? - by pp
re:Brainy (11:01pm EST Fri Feb 10 2006)
Pipe down retard. Grab your ipod and head to the mall with your leg warmers - by Not a sh*ithead
Hmmm (11:55pm EST Fri Feb 10 2006)
Top article! I like, especially, the fact that it is in simple english, and easy to understand.

I can see now why a deeply pipelined processor can go alot faster than a
shorter one, but how a shorter processor can get more work done.
Sun have gone back in time with the T1, 5 pipeline stages overall. With AMDs
12 and the P3 (and probably PM's) 10, those two should be close performers
for quite some time.
- by Headley
Good stuff` (12:07am EST Sat Feb 11 2006)
-The MUL opcode on modern x86-64 processors takes 3-8 cycles (depending on whether its a 16, 32, or
64 bit multiply and whether the numbers are stored in registers or in memory).
I move (MOV) from memory to a cpu register takes 3 clocks on a modern x86-64.

-Branch prediction algorithms can only go so far. For loops with moderate to
high loop counters they work fine.
E X A M P L E x86 32bit assembly
MOV EAX,1000 //cpu register = 1000
LABEL1: //(an address in the code)
… //run some code in the loop
SUB EAX,1 //subtract 1 from eax
TEST EAX,EAX //logical EAX AND EAX
JNZ LABEL1 // if EAX != 0 then LOOP
Branch prediction will be fine on the above code because it's easy to predict
that the loop counter will not be zero from previous iterations.
-The problem with optimizing is everything needs to be taken into context with
Out of Order execution.
In some cases a well coded optimizing compiler works fine but it others it fails
miserably (especially when trying to make tasks parallel with SSE).
ANOTHER E X A M P L E

LBLNUM1:
dd 5088 //a 32bit integer in memory
LBLNUM2:
dd 11119 //a 32bit integer in memory
LBLRESULT:
dd ? //empty 32bits of memory
MULFunction1:
MOV EAX, (LBLNUM1) //EAX = 5088
MUL (LBLNUM2) //EDX:EAX = EAX * 11119
MOV (LBLRESULT), EAX // save answer
RET 0
MULFunction2:
MOV EDX, (LBLNUM2)
MOV EAX, (LBLNUM1)
MUL EDX
MOV (LBLRESULT), EAX //store answer
RET 0
Ok there's two functions that do the same thing. Func1 takes 12 clock cycles
and Func2 takes 12 clock cycles. If you are JUST looking at the functions
you'd pick Function1 because its 1 less instruction, but if you process the
MOV's in Function2 in advance (separate them from the MUL instruction)
Function2's method of doing it could be faster.
- by r22
Ahhh interesting… (4:22am EST Sat Feb 11 2006)
it takes 30 years for the MHz to increase a thousand-fold. - by BoFox
Mhz Chart.. design dates or release? (4:53am EST Sat Feb 11 2006)
That table.. it must be referring to the time that a particular design started on the drawing board right?
The 386 didn't hit the shelves for a consumer here (north amercia) until 1993. - by When
2 cent's (11:10am EST Sat Feb 11 2006)
general, the speedup in completion rate versus a single-cycle implementation that's gained from
pipelining is ideally equal to the number of pipeline stages. A four-stage pipeline yields a four-fold
speedup in the completion rate versus single-cycle, a five-stage pipeline yields a five-fold speedup, a
twelve-stage pipeline yields a twelve-fold speedup, and so on. This speedup is possible because the
more pipeline stages there are in a processor, the more instructions the processor can work on
simultaneously and the more instructions it can complete in a given period of time. - by simple fi
Greg Edwards (12:06pm EST Sat Feb 11 2006)
I apologize for getting off topic, but your “what do you think you'll be using in 2010?” statement is wishful
thinking. Given how many people are still running Windows 98 in 2006 and the fact that x86-32 is just
about 20 years old and dominates the computer industry, I'd say it's a very safe bet that a large majority
of computer users will still be computing with x86-32 and particularly x86-64 in 2010. Do you honestly
think everybody's going to ditch their PC's for Itanium systems in the next four years? I wouldn't even give
you 2020. Heck, even Apple is x86 now, and Intel and AMD are pouring more money than ever into their
x86-64 chip lines. Conversely, rather than moving towards the mainstream, IA64 has done nothing but
move to the niche market of very high-end servers over the last 2-3 years. In 2003, the average unit price
for an Itanium server was $25K. That price has now tripled to ~$75K. -by Grover
Keep it coming! (1:15pm EST Sat Feb 11 2006)
Articles, like this one, are the reason I come to Geek.com. Computers are my hobby, not my job. I come

here to learn. - by Randomize timer
simple fi (2:00pm EST Sat Feb 11 2006)

That's not a perfectly straight-forward relationship. The primary purpose for
breaking out a workload into stages is so that each stage can be completed
more quickly. When you have a five-stage pipeline compared to a 12-stage
pipeline, each stage is doing more work in the five than in the twelve,
therefore it takes longer to physically process. As such, the 12-stage pipeline
can clock much higher because each stage is doing less. The downside is it
takes 12 clocks to get the first bit of work processed through.
Also, what you said isn't exactly true:
“This speedup is possible because the more pipeline stages there are in a
processor, the more instructions the processor can work on simultaneously
and the more instructions it can complete in a given period of time.”
The actual end-result of the additional clock speedup advantage allows your
statement to be true on certain types of code, but it's not the simple fact that
there are more things being done in parallel which make it faster overall. It's
the fact that more things are being done in parallel *AND* at a higher clock
rate that actually makes it faster.
In short, if you took a 12-stage processor and a 5-stage processor and
clocked them at the same speed, the 5-stage processor would do more work
in a given amount of time.
- by RickGeek
What happens next is dependant on COMPETITION (5:39pm EST Sat Feb 11 2006)
No one wins but Intel if there is no competition and the consumer is the biggest loser.
The features that AMD is now adding should have been added a decade ago.
There are many types of IT workers and I admit to knowing nothing about IC design other than what I had
to in college 25 years ago. - by tech
hmmmm (8:30pm EST Sat Feb 11 2006)
Holy shit, you guys actually came up with an article on your own without taking it from somewhere else on
the web? Or did you take it from somewhere else hmmmmmm… - by …
POWER6 details…….. (10:41pm EST Sat Feb 11 2006)

Mainframes coming to a desktop near you……POWER6
- by Power2thePeople
WOW (3:38pm EST Sun Feb 12 2006)
so that means my Pentium 1 PC has the most efficent performance because it only has 5 pipelines!!!
so 10 pentium 1s will be waaaay faster than a dual processor northwood PC!!! - by Inte-L-egacy
Or 8 PIII, 1/4th the size with 1/2 the pipes, (4:43pm EST Sun Feb 12 2006)
than the dual cores. OoO would greatly improve performance as the number of cores increases as would
optimized cores like fpu, crypto, tcpip offloading …. - by tech
pipeline (5:49pm EST Sun Feb 12 2006)
if it's as big and as long as mine are, then it'll be good. It should be able to multiple process. - by truedat
Thank You! (8:05am EST Mon Feb 13 2006)

Thank you RickGeek, 2 Questions if you please:
1- are there any chips out there that has three pipelines? You could
hypocthetically fill two pipelines, one with if yes one with if no and have the
thirs one in standby. And then have a dispatch chip that says which one of the

yes or no executes which means the other two will start refilling the seoncd
branch when this one is executing. Think about it, might be a costly or
impossible design but I can just imagine the superspeeds that can be attained
even with modest clock cycles. Or am I completely wrong??
2- I am guessing there already are chips that are custome tailored to existing
MPEG, MP3 and other processing intensive file formats, if yes, which ones?
- by 2nd Op
2nd Op (6:40pm EST Mon Feb 13 2006)

“1- are there any chips out there that has three pipelines?”
The number of simultaneous pipelines is different from the number of pipeline
stages, and after reading your question I can see you how would ask it. I
should update my article to make that point clear.
Within a logical pipeline are stages which carry out physical workloads in
parallel within the single stage. This allows multiple logic units, for example, to
operate on several instructions at the same time. That amount of parallel work
also requires that additional facilities be provided at each stage to carry out
those types of workloads. And, whereas certain stages do not have to be as
wide as the execution units that carry out the work (because the execution
units can take many clock cycles to complete a complex operation while the
pipeline feeding those execution units can be completed in many fewer clock
cycles), it is still a relatively straight-forward way of thinking to visualize the
entire pipeline as you suggest.
“2- I am guessing there already are chips that are custome tailored to existing
MPEG, MP3 and other processing intensive file formats, if yes, which ones?”
I doubt there are any chips outside of specialized add-on co-processors, or
dedicated micro-controllers which do those kinds of things. And even then, I
would suggest that they are actually running a tiny program inside of the core
itself that simply exposes a type of API to the other CPU which can be thought
of as those types of abilities.
Software is something that can be carried out logically as you suggest, and it's
far easier than making an entire silicon device whose sole purpose is to carry
out those kinds of variable workloads. And, for many embedded RISC cores, it
would probably be as fast, if not faster, considering the R&D money a
company would have to spend on that particular implementation, compared to
the general functionality and use you would get out of a regular CPU that is
not so specialized.

Pipeline

Comments

Content

Sponsor Documents

Recommended