( ESNUG 360 Item 15 ) -------------------------------------------- [11/02/00]
Subject: ( ESNUG 359 #6 ) Design A 64-bit+ Multiplier-Accumulator (MAC)
> I would love to hear from your readers how they'd design a large
> Multiplier-Accumulator (MAC) for over 64-bit operands. I'm considering
> Module Compiler. Our implementation technology is not decided yet, but
> I'm guessing 0.18um or smaller. We're targeting in excess of 133MHz.
>
> In terms of speed/power, pointers/numbers would be greatly appreciated,
> as also techniques to verify this type of circuit. (Obviously, we're not
> circuit designers, and would probably do a very poor job at a custom
> multiplier.) Ideas? Pointers?
>
> - Neel Das
From: Gil Herbeck <gilherbeck@home.com>
John,
There are a number of factors that can have a big influence on this design.
How big are the operands and the accumulator? Do you need saturation logic?
If so, are there multiple (programmable) saturation points? Do you have
both integer and fixed point, or just one data type? Can you have latency
from the inputs to the accumulator? Is your process / cell library
optimized for area and power? It's hard to say much without more specific
info.
If performance becomes a problem, MC can provide a big advantage for
non-interleaved accumulators. You may be able to leave the accumulator
itself in carrysave format and push the carry propagation to after the
accumulation register.
- Gil Herbeck
Radix20 Livermore, CA
---- ---- ---- ---- ---- ---- ----
From: [ A Synopsys Module Compiler CAE ]
Hi John,
It is fairly straight forward to implement a simple MAC in Module Compiler
(MC). You get full operator merging (a single carry save reduction/Wallace
tree with just one carry propagate adder for the entire multiply and add
operation as well as any other addends), a choice of different multipliers
(booth/non-booth) and final addera (fast-carry-lookahead, carry-lookahead,
carry-select, the Synopsys proprietary carry-lookahead-select and ripple)
micro-architectures to trade off area/timing.
You can also parameterize these options along with the input operand widths
and different implementations of the MAC to perform fast architectural
exploration. This is shown in the first architecture (arch==0) of the
following piece of Module Compiler Language (MCL) code.
module MAC (Z,X,Y,R,w,ovf,mult,fa,arch);
integer w = 64; // Input width
integer ovf = 2; // Overflow accum. bits
integer accw = 2*w + ovf; // Accumulator width
integer arch = 0; // MAC architecture
string mult = "booth"; // Multiplier type
string fa = "cla"; // Final adder type
directive(multtype=mult,fatype=fa,pipeline="off");
input signed [1] R; // Accumulator reset
input [w] X,Y;
output [accw] Z;
if (arch==0){
wire [accw] ACCin = X*Y + (Z&R);
Z = sreg(ACCin);
}
// arch-1 is also implemented by an MC built in function maccs()
// arch-1 can be modified slightly to pipeline the multiplier
// and the final adder to further speed up the MAC.
if (arch==1){
wire [accw] ACC0,ACC1,ACCin0,ACCin1;
directive local (carrysave="convert");
wire [accw] ACCin = X*Y + (ACC0&R) + (ACC1&R);
csconvert(ACCin0,ACCin1,ACCin);
ACC0 = sreg(ACCin0);
ACC1 = sreg(ACCin1);
Z = ACC0+ACC1;
}
endmodule
As all of us know, the critical path is from the inputs, thru' the merged
multiplier and propagate adder in the accumulator. You can individually
access the output of the accumulator "carry" and "sum" terms to 'push'
the final propagate adder out of the sequential feedback loop. This will
speed up the design and may be done by setting the carrysave attribute in
MC to "convert" and using the csconvert() function. That was shown in the
second architecture (arch==1).
o The second architecture can be easily modified to isolate the
multiplier, so that it can be pipelined and retimed by MC along
with the final adder. This will further speed up the design
without changing the basic functionality of the MAC.
o After synthesis, MC will write out a bit and cycle exact RTL
simulation model, either in Verilog and VHDL. This can be used for
running your fast functional simulations to verify your design.
Of course, you'll use the gate-level netlist for full simulation.
To give you a flavor of what the results look like, I used the Synopsys
DesignWare Silicon Library (std. cell) developed for TSMC's 0.18G process
to run a couple of tests. This is for a 64-bit operand MAC with out any
pipelining. Of course, your results will vary depending on the technology
library you use.
Arch-0: # of instances= 5910; delay= 7.58ns (~132 MHz)
Arch-1: # of instances= 6275; delay= 5.33ns (~188 Mhz)
The above delay numbers can be reduced significantly by pipelining the
multiplier and the final propagate adder, until you hit the limit of
the loop delay, which then will be the critical path. Here's the results
for a pipelined and retimed MAC with 2 pipe stages in the multiplier and
one in the final carry propagate adder (for a total of 3 in the design):
Modified Arch-1: # of instances=8999; delay=3.37ns (~300 MHz)
You may get aggressive delays with smaller process tech. libraries, but
without changing the functionality of the MAC, you'll always be bound by
the feedback loop delay.
Hope this helps.
- [ A Synopsys Module Compiler CAE ]
|
|