Subject: Discovering "Attack Turkeys" At The Holliston Poor Farm
Two years ago, a Holliston suburban housewife took her 5 year old son for
a walk on a road near my farm's swampy woodlands. They came across a
flock of half-wild/half-domesticated turkeys that lived in those woods.
The 5 year old decided it would be fun to run after the flock and suddenly
he encountered a fairly large (compared to the boy) male turkey bravely
defending his hens. Although the boy wasn't hurt at all, the shocked
housewife called the police to report "an attacking turkey." It being a
slow news day, the radio dispatch telling the police to investigate
"attack turkeys" not only brought police, but also a local TV news crew!
Finding that the turkeys did nothing unless you chased them, the police
left. The TV reporters then chased the turkeys around until they got
enough "attacks" on video to make for a good story, and then also left.
- John Cooley
the ESNUG guy
Next Week: (Part 2) "Attack Turkeys" & Massachusetts Tax Dollars At Work
( ESNUG 249 Item 1 ) -------------------------------------------- [8/96]
Subject: ( ESNUG 247 #3 248 #3) Benchmark & Opinions On Hardware Emulators
From: [ Synopsys Arkos R&D ]
> FYI: Quickturn gate capacity tends to be overstated from our experience;
> "emulation gates" are typically 2-3X smaller than the design gates that
> Design Compiler would tell you about. I think experienced Quickturn users
> understand this distinction, but it does tend to blur comparisons of
> capacity.
From: Don Monroe <Don_Monroe@synnet.com>
> The last sentence of [ Call Me Ishmael ]'s critique of emulation systems
> suggested that emulation is usually 6X faster than simulation. In my
> experience (4 to 5 yrs using Pie/Quickturn) I would say that emulation
> approaches 1 million X of simulation! If he's only getting 6X he's doing
> something wrong.
From: naeem@quickturn.com (Naeem Zafar)
Hi John,
1) Real gates vs. "emulation gates" -- unlike the comment someone made,
emulation gates are NOT 2-3X smaller than the real ASIC gates -- the actual
gate capacity depends on the design style -- much like in an ASIC.
Emulation gate was designed to be a unisex unit of measure which includes a
certain combination of memory bits and logic gates. If one has 64 bit
buses and multi-port RAMs the actual gate capacity of a 250K EMulation gate
capacity Quickturn machine may be only ~130K ASIC gates, but if one has 32
bit buses and mostly single port RAM and mostly flip flops the capacity can
be higher, in some cases closer to 250K actual gates. This issue of
predicting exact capacity is not unique to Quickturn. No vendor in
emulation business can predict exact capacity and one tends to be more
optimistic until one has a few arrows in their back.
2) Like Don Monroe says if someone is getting only 6X speed up with
emulation they are doing something wrong -- Quickturn customers typically
get 100,000 to 1,000,000 times speedup and this speed allows users to
validate software and the chips in the context of the real application.
3) Quickturn can handle designs at RTL level (Verilog and VHDL) in addition
to the gate level.
4) And yes, we do believe in great DAC parties! :^)
- Naeem Zafar
Quickturn
( ESNUG 249 Item 2 ) -------------------------------------------- [8/96]
Subject: ( ESNUG 248 #8 ) Handling Timing Paths Through A Bi-Directional Bus
> I have registers with paths through logic, then through a tri-state driver,
> and then out to a bidirectional bus. Data can also be brought in from the
> bus to the same registers but not on the same bus cyle. I need to disable
> the timing path which loops from the flip-flop Q out through the bus
> interface and back to the flip-flop D input without disabling other paths
> with the same end-points. I still need the timing to be checked between
> the data bus and the flip-flop.
> Bidirectional Data Bus
> ______________________ ======================
> | ___ ^ Enable ___ ^
> V______| | ____ | __ | | __
> ______|MUX|__| | | ,-~ \__ |\| | |\ ,-~ \__
> ^ |___| | |___|__( Logic }____| \___V___| \___( Logic }__
> | | | FF | | \ _/ | / | / \ _/ |
> | Select_| ck-|> | | ~---' |/ |/ /~---' |
> | |____| V__________________________________/ |
> |___________________________________________________________________V
>
> Has anyone come up with a good solution to this type of timing problem?
From: Ajay Nath <nath@ecpi.com>
John, I figured it was time I gave back to ESNUG.
This is a classic case of using path segmentation. Path segmentation has
uses when there are multiple paths between two points, and they are
multicycle. It is preferable over path disabling, because you still
specify timing requirements. If set_multicycle_path or set_false_path had
a '-through' option, it would solve a lot of problems like these.
To perform path segmentation you must break the combinational path (at
the bidirectional data bus, in this case). When this is done, the break
in the path creates a new end point, and a new starting point. Timing to
(from) this end (starting) point must be specfied in order to meet path
timing requirements. But there is no through path anymore.
So, if I have interpreted Greg's problem correctly, we would end up with:
Bidirectional Data Bus
_____________________ ======================
| ___ ^ Enable ___ ^
V______| | ____ | __ | | __
______|MUX|__| | | ,-~ \__ |\| | |\ ,-~ \__
^ |___| | |__|__( Logic }____| \__A B_V___| \___( Logic }_
| | | FF | | \ _/ | / | / \ _/ |
| Select_| ck-|> | | ~---' |/ |/ /~---' |
| |____| V_____________________________________/ |
|_____________________________________________________________________V
Where A, and B are the new end and starting points. You would then
specify the required setup/hold output requirements at A, and the input
delays at B to complete the path timing picture. What this process has
done, is that it has broken the path through the data bus. Now you have
the true paths [FF to FF through logic, FF to databus, and databus to FF]
separated, and can optimize accordingly.
For a 32 bit data bus, assuming no other driver on the bidirectional data
bus in this module, and the variable 'databus' is a list of all data bits:
foreach (databusnet, databus) {
remove_variable driver_pin
driver_pin = filter (all_connected(databusnet), "@pin_direction==out")
set_input_delay ... [to driver_pin]
set_output_delay ... [from driver_pin]
:
set_false_path .. [if needed]
}
NOTE: The above methodology is to be applied when the design has been
mapped. This cannot be applied on an unmapped design, since this
requires that the driver pin be a valid mapped cell pin.
You have to specify timings to A and from B, and should be able to get
these from the mapped design, if the databus is not a constrained I/O
(internal signal). An implicit dont_touch attribute is placed on the
leaf cells associated with the start or endpoint of a (segmented) path.
[set_input_delay / set_output_delay]. This will cause a dont_touch to
be placed on the bus driver in the above example.
- Ajay Nath
Motorola
---- ---- ---- ---- ---- ---- ---- ----
From: Oren Rubinstein <oren@waterloo.hp.com>
Hello John.
The solution is called "path segmentation". You apply a pair of
set_input_delay and set_output_delay on the same point, which creates a
new startpoint and endpoint in the middle of the path. In this case, the
correct point to apply the commands to, is the output of the tristate driver.
Unfortunately, applying the constraints also applies an implicit
dont_touch on the tristate driver, and that prevents DC from mapping it.
Which means you have to compile twice: the first time to map it,
then you do the segmentation, then an incremental compile.
- Oren Rubinstein
Hewlett-Packard (Canada) Ltd.
---- ---- ---- ---- ---- ---- ---- ----
From: ryan@fsd.com (Ken Ryan)
John,
We had nearly the same situation as Greg. What you need is "set_false_path
-through", which of course doesn't exist. We spent considerable time
looking for a way to get the same effect, without finding one.
Here are two things we did to take care of it:
1) Path segmentation. There's a methodology note on it in the online docs,
which is pretty good. The node where the bidi bus connects to the block
can be disabled with set_input_delay and set_output_delay. The downside
is having to maintain consistent timing budgets, and adjust them as
optimization warrants. If you have a *lot* of paths like this, it can
get out of hand quickly.
2) Nuke the bidi buses. That's what we eventually did. We were able to
make a floorplan that had contained the bidi busses in a relatively
small area, so we restructured it to use muxes instead of tristate
busses. It ended up much cleaner, and as a bonus the time to do
update_timing went down by a factor of about 8.
- Ken Ryan - Glenn Griffin
Orbital Sciences Corp. Intrinsix Corp.
( ESNUG 249 Item 3 ) -------------------------------------------- [8/96]
From: steedman@pmcmac.pmc-sierra.bc.ca (Richard Steedman)
Subject: Huh? Design Compiler Puts In Registers With D Inputs Tied Low???
John,
In one of my designs, I have a 32-bit registered output, the top 16 bits of
which happen to be zero, e.g:
IF clk'EVENT AND clk = '1' THEN
IF cond THEN
output <= "0000000000000000" & val_a;
ELSE
output <= "0000000000000000" & val_b;
END IF;
END IF;
Synopsys seems to insist on synthesising 16 registers with inputs tied to
ground for the top 16 bits. Is there any way to get Synopsys to blow the
registers away i.e. have 'output(31 DOWNTO 16)' directly tied to ground?
(Incremental compiles do not seem to help.)
- Richard Steedman
PMC-Sierra, Inc.
( ESNUG 249 Item 4 ) -------------------------------------------- [8/96]
Subject: ( ESNUG 248 #4 ) Intermittent VHDL 3.3a Elaboration w/ Multiple Libs
> We have experienced intermittent problems with the following design
> library setup. ... The "elaborate" command occasionally fails
> to link in a component or two from a separate, but correctly
> referenced, design library. ... Unfortunately, every time we try to
> extract a small test case from our environment the problem evaporates.
> Often the problem disappears after deleting all analyzed files and
> re-analyzing everything. We've just been unable to isolate the factor(s)
> which trigger this problem. ... We'd like to know if anyone else has
> experienced this and if they have a better idea of the cause. Our only
> workaround is to throw everything into the WORK library.
From: Victor Preis <Victor.Preis@zfe.siemens.de>
Hello, John,
We've had the same experiences with multiple libraries for Version 3.3b and
3.4b. The only solution is that your designs are analyzed with the entity
and architecture in one (!) analyze call either by:
- putting both in one vhdl file
or
- calling dc_shell> analyze ... { entity architecture }
It looks like that SYNOPSYS looses some information when anlyzing the
entity and architecture in separate calls. One other observation on
multiple libraries: SYNOPSYS has a different LIBRARY concept than VHDL.
For SYNOPSYS the WORK library means: RAM, WORK and all which is visible
on the search path. You will get a warning in the following set-up:
1.) Design A contains two instantiations of design B.
2.) A is in Library WORK, B is in Library XY.
During the elaboration, the first instantiation of B is elaborated without
any problems. The elaboration of the second instantiation produces the
following message:
Warning: Instance of 'B' is defined in both libraries
'XY' and 'WORK', which are both visible in design 'A'.
The first library will be used. (LINK-14)
This is very confusing. You have to switch the semantics of the VHDL library
concept with each new tool (simulation vs. synthesis). You have no control
on which design is actually read. And it doesn't save on elaboration time
(i.e. it elaborates the identical components from the same library over and
over again!!)
- Viktor Preis
Siemens AG, R&D
( ESNUG 249 Item 5 ) -------------------------------------------- [8/96]
From: ryan@fsd.com (Ken Ryan)
Subject: HEEEEEELP! -- I've Got CLI or SWIFT or VSS Memory Leaks!
Hey, John!
I'm encountering a mysterious memory leak in my simulation environment.
My testbench in VSS uses RTL VHDL, SmartModels, and the CLI C interface.
I'm running Sun/SunOS4.1.4. There is a memory leak somewhere -- the
simulator process starts out with a size of ~100 megs, but after running
overnight (about 15ms with a 1ns timebase and a 30ns system clock) the
process grows to over 600 megabytes! Nowhere do I use dynamically allocated
variables (VHDL or C) except the instance structure in CLI:
- The structure is allocated in the elaborate routine
by calling cliAllocMem().
- The structure is stored at the end of the elaborate routine
by calling cliSaveInstanceData().
- The structure is restored at the beginning of the evaluate routine by
calling cliRestoreInstanceData().
The evaluate routine is called once per system clock. I'm using the SWIFT
version R40b SmartModels, the models are upd4218160 and tms55161. The VSS
version is v3.4a and v3.4b (tried both). I checked the SOLVIT database and
saw the note about the TEXTIO package. I use it in one spot, but only for
writing lines to the screen and to a disk file. I checked over my own C
code pretty carefully, but I don't see where a leak might be. I'm not
ruling this one out, though.
Does anyone know of any memory problems with VSS or SmartModels or CLI
function calls? I'm out of ideas for where to look.
- Ken Ryan
Orbital Sciences Corp.
( ESNUG 249 Item 6 ) -------------------------------------------- [8/96]
Subject: ( ESNUG 248 #5 ) FSM Treatment Doesn't Seem Consistant Or Coherent
>I am busy with state machine synthesis and I have been fighting with design
>compiler some days in order to get the last picoseconds out of my FSM, that
>is supposed to reach 66 MHz. ... I decided to check Design Compiler using
>an identical FSM coding by swapping the columns of my state coding. I was
>sure to get an identical result, where the synthesized flipflops
>(SIG_st_sm_next_reg[0][1][2][3]) just changed their order.
>
> attribute ENUM_ENCODING of sm_next_state_type : type is
> -- order ABCD
> -- "0011 0010 0001 1010 0000 0100 0101 1101 1100 1011 1111 1001";
> -- order ADBC
> "0101 0001 0100 1001 0000 0010 0110 1110 1010 1101 1111 1100";
>
>I am very unhappy to see that the IDENTICAL synthesis script on an
>IDENTICALLY coded state machine produces DIFFERENT results! ...
>I understand that I get quite different results when using different state
>assignments. Do you know an explanation, why the results also differ, if
>I change only the ORDER of my state vectors?
From: [ Synopsys HDL Advisor Technical Marketing ]
Dear John,
Although, I can not provide an explanation of the results of state machine
compiler I can sympathize. The starting point for synthesis is *extremely*
important. In fact, the Design Compiler optimization is limited by this
starting point. This means that in many situations it is more efficient to
change the source code. You get a better result and it may even take less
time than iterating with constraints.
Last year we released a new tool, HDL Advisor, which provides metrics
about the starting point for synthesis. In many cases we have adjusted
the source to take into account critical paths (late arriving signals)
and made significant gains in results. Since you are already spending
quite a bit of time in trying to meet timing, It may make sense to
see what the tool says about the source.
- [ Synopsys HDL Advisor Technical Marketing ]
---- ---- ---- ---- ---- ---- ---- ----
---- ---- ---- ---- ---- ---- ---- ----
[ Editor's Note: What follows is an exchange of e-mails between Ted
Boydston and Deiter Peer on the synthesizing FSMs problem. - John ]
From: tboydsto@su102s.ess.harris.com (Ted Boydston)
To: peer@iis.fhg.de (Dieter Peer)
Dieter,
If I understood the question, what you did by changing flip flops ABCD to
flip flops ADBC is to change the sequence the machine counts. So, yes,
order does make a difference, because the transfer function that generated
the next state transistions had to change. That is, a state machine with
two different state vector series, must have two diffenent next-state
transfer functions to generate the output, hence different timing parameters.
A simple example would help here:
Let's say you are making a 16 state state machine. You could use a 4 bit
binary counter as your state bits or a 4 bit linear shift feedback register
(LSFR) as your state bits. Since the counter generates a binary sequence,
while the LSFR generates a psuedorandom sequence, the binary counter is
inherently much slower than the LSFR. So, to obtain maximum performance in
this simple case, you would choose the LSFR.
By the way, your encoding method is probably not the best encoding if speed
is a concern. My recommendation, if area is not a concern, is to use one-hot
(or one-cold) encoding in Synopsys's FSM compiler -- this will give you the
highest performance. An example of one-hot in an 8 state state machine:
00000001
00000010
00000100
00001000
00010000
00100000
01000000
10000000
As you can see, each state is assigned to a flip flop, allowing for minimal
decode delay. If area is a concern, then grey codes are the most efficent.
An example of grey codes in an 8 state state machine would be:
001
011
111
110
100
000
100
110
In this case, only one bit is changing at any one time, allowing for minimal
decode delay.
- Theodore L. Boydston IV
Harris Corporation
---- ---- ---- ---- ---- ---- ---- ----
To: tboydsto@su102s.ess.harris.com (Ted Boydston)
From: peer@iis.fhg.de (Dieter Peer)
Ted,
Your answer is correct, but my question was different. (Sorry I am not a
native English speaker.)
Using one-hot for my 12-state statemachine does *not* give the best (max
speed, any area) results. The gate library (ES2 Gate Array) provides
maximum 4-input-gates (and 6-input-complex-gates). DC synthesizes too many
logic gates to the outputs of (here: 12) flipflops so the time thru logic
gates within the fsm limits the speed. My best results due to speed were
achieved using 4- or 5- bit state assignment.
Lets assume your 8-state-assignment example using one-hot, naming the
flipflops A,B,C,D,E,F,G,H. If you swap the assignment by changing some of
the columns to lets say C,D,E,F,G,H,A,B you get "another" one-hot. But
wouldn t you expect to get identical circuitry synthsized?
ABCDEFGH CDEFGHAB
00000001 00000100
00000010 00001000
00000100 00010000
00001000 00100000
00010000 01000000
00100000 10000000
01000000 00000001
10000000 00000010
I would expect to get a (functionally) identical netlist, where only the
*names* of the flipflops are swapped. (A->C, B->D, .... F->H, G->A, H->B)
If you do not agree with me, do you have some advise on how to optimize the
order of 1 s in a one-hot assignment (or the gray code assignment) for the
fastest possible circuit? The same would apply for your gray-code
assignment (ABC swapped to ACB):
ABC ACB
001 010
011 011
111 111
110 101
100 100
000 000
100 100
110 101
- Dieter Peer
Fraunhofer Gesellschaft, Germany
---- ---- ---- ---- ---- ---- ---- ----
From: tboydsto@su102s.ess.harris.com (Ted Boydston)
To: peer@iis.fhg.de (Dieter Peer)
Dieter,
I now see what you are saying, and for your specific column swap example on
the one-hot case I would agree, because you are not changing the actual
transition pattern (explained latter). For the gray code case, though, I
believe there is a problem. In this case, you performed the following
column swap:
ABC ACB Transition
000 000 1
001 010 2
011 011 3
111 111 4
110 101 5
100 100 6
110 101 7
010 001 8
If you notice, there are very few similar transitions between case ABC and
case ACB (like 3->4), while non-similar transitions dominate (1->2 and 4->5).
Since the machine "counts" a different pattern, I believe that different
control logic will be synthesized to produce the desired pattern, resulting
in a slight performance difference.
Another example would be a straight binary counter with a swap:
ABC ACB Transition
000 000 1
001 010 2
010 001 3
011 011 4
100 100 5
101 110 6
110 101 7
111 111 8
In this case, you can see that case ABC would synthesize to an adder and
control logic, while case ACB would synthesize to some random logic state
machine, resulting in a signifigant performance difference.
Your one hot case of:
ABCDEFGH Transition CDEFGHAB Transition
00000001 1 <- reset 00000100 3 <- reset
00000010 2 00001000 4
00000100 3 00010000 5
00001000 4 00100000 6
00010000 5 01000000 7 Transition order
00100000 6 10000000 8 does not change,
01000000 7 00000001 1 its only shifted.
10000000 8 00000010 2
is a special case because the transitions are exactly the same -- just
shifted-- and I would hope to see the same logic -- a shift register. :^)
Now, let's say you changed the order of the machine from ABCDEFGH to
HACFGEBD, the machine would no longer transition in the order of a shift
register, resulting in a differenct circuit. By the way, I think if you
only shift the order of the state bits as a set, without swapping them, you
should achieve the same results with minimal performance impact in any
circuit.
As for your question of increasing performance, my first post still holds
about one-hot and gray codes. For one hot, your mealy or moore output
should be the state, and since there is one flip flop per state, you should
not need wide combinatorial decodes to get valid state transitions. Take
your circuit into the FSM Compiler and perform a one-hot synthesis -- that
should give you the fastest output. You will need to modifiy your HDL a bit
to get it to read it into FSM Compiler, but overall you should not have
many problems. I hope that you can find a solution!
- Theodore L. Boydston IV
Harris Corporation
( ESNUG 249 Item 7 ) -------------------------------------------- [8/96]
From: wendyl@brooktree.com (wendy liu)
Subject: What's The Best Way To Synthesize Multipliers In Synopsys?
Hi John,
Could you share your multiplier design experience using RTL Verilog code and
Synopsys? I know "X=A*B;" doesn't work well. I also know that there are
people who hand instantiate Booth decoders and adders. What is the most
efficient and flexible way to synthesize a multiplier?
- Wendy Liu
Brooktree
( ESNUG 249 Item 8 ) -------------------------------------------- [8/96]
From: dmullen@net.com (David Mullenex)
Subject: Correlating A Mapped Xilinx Design to HDL Advisor 3.4b GTECH
Hi John,
Thanks for running the ESNUG, providing a forum for discussing and sharing
experiences with the $ynop$y$ tool set.
As part of my Xilinx XC4000 FPGA design process, I was using HDL Advisor to
get a pre-mapped prediction of the results my source code would produce. I
used info provided by the HDL Advisor to go back and refine my source code.
Now that I have synthesized my design I would like to take the mapped design
back to HDL Advisor and try correlating the pre-mapped predictions with the
mapped results. Obviously, the Xilinx FPGA architecture and designware make
it hard to correlate back to the Advisor GTECH (Boolean Generic Technology).
I was hoping to use this approach with two objectives. The first to simplify
and somewhat automate the correlation between my mapped design and my source
code. The second, to get a feeling for the relationship between HDL Advisor
GTECH predictions (using only two input logic functions) and the Xilinx FPGA
mapped design.
Has anyone tried this approach successfully? Is this function supported by
HDL Advisor (It's GUI seems to indicate that it can be done.)
- David Mullenex
Network Equipment Technologies
|
|