Synopsys Mentor Cadence TSMC GlobalFoundries SNPS MENT CDNS


  Subject: Discovering "Attack Turkeys" At The Holliston Poor Farm

  Two years ago, a Holliston suburban housewife took her 5 year old son for
  a walk on a road near my farm's swampy woodlands.  They came across a
  flock of half-wild/half-domesticated turkeys that lived in those woods.
  The 5 year old decided it would be fun to run after the flock and suddenly
  he encountered a fairly large (compared to the boy) male turkey bravely
  defending his hens.  Although the boy wasn't hurt at all, the shocked
  housewife called the police to report "an attacking turkey."  It being a
  slow news day, the radio dispatch telling the police to investigate
  "attack turkeys" not only brought police, but also a local TV news crew!

  Finding that the turkeys did nothing unless you chased them, the police
  left.  The TV reporters then chased the turkeys around until they got
  enough "attacks" on video to make for a good story, and then also left.

                                        - John Cooley
                                          the ESNUG guy

  Next Week: (Part 2) "Attack Turkeys" & Massachusetts Tax Dollars At Work

( ESNUG 249 Item 1 ) -------------------------------------------- [8/96]

Subject: ( ESNUG 247 #3 248 #3) Benchmark & Opinions On Hardware Emulators

From: [ Synopsys Arkos R&D ]

> FYI: Quickturn gate capacity tends to be overstated from our experience;
> "emulation gates" are typically 2-3X smaller than the design gates that
> Design Compiler would tell you about.  I think experienced Quickturn users
> understand this distinction, but it does tend to blur comparisons of
> capacity.

From: Don Monroe <Don_Monroe@synnet.com>

> The last sentence of [ Call Me Ishmael ]'s critique of emulation systems
> suggested that emulation is usually 6X faster than simulation.  In my
> experience  (4 to 5 yrs using Pie/Quickturn) I would say that emulation
> approaches 1 million X of simulation!  If he's only getting 6X he's doing
> something wrong.


From: naeem@quickturn.com (Naeem Zafar)

Hi John,

1) Real gates vs. "emulation gates" -- unlike the comment someone made,
emulation gates are NOT 2-3X smaller than the real ASIC gates -- the actual
gate capacity depends on the design style -- much like in an ASIC.
Emulation gate was designed to be a unisex unit of measure which includes a
certain combination of memory bits and logic gates.  If one has 64 bit
buses and multi-port RAMs the actual gate capacity of a 250K EMulation gate
capacity Quickturn machine may be only ~130K ASIC gates, but if one has 32
bit buses and mostly single port RAM and mostly flip flops the capacity can
be higher, in some cases closer to 250K actual gates.  This issue of
predicting exact capacity is not unique to Quickturn.  No vendor in
emulation business can predict exact capacity and one tends to be more
optimistic until one has a few arrows in their back.

2) Like Don Monroe says if someone is getting only 6X speed up with
emulation they are doing something wrong -- Quickturn customers typically
get 100,000 to 1,000,000 times speedup and this speed allows users to
validate software and the chips in the context of the real application.

3) Quickturn can handle designs at RTL level (Verilog and VHDL) in addition
to the gate level.

4) And yes, we do believe in great DAC parties!  :^)

  - Naeem Zafar
    Quickturn


( ESNUG 249 Item 2 ) -------------------------------------------- [8/96]

Subject: ( ESNUG 248 #8 ) Handling Timing Paths Through A Bi-Directional Bus

> I have registers with paths through logic, then through a tri-state driver,
> and then out to a bidirectional bus.  Data can also be brought in from the
> bus to the same registers but not on the same bus cyle.  I need to disable
> the timing path which loops from the flip-flop Q out through the bus
> interface and back to the flip-flop D input without disabling other paths
> with the same end-points.  I still need the timing to be checked between
> the data bus and the flip-flop.
>                                      Bidirectional Data Bus
>    ______________________            ======================
>   |       ___            ^        Enable ___    ^
>   V______|   |   ____    |      __          |   |             __
>    ______|MUX|__|    |   |   ,-~  \__     |\|   |   |\     ,-~  \__
>   ^      |___|  |    |___|__( Logic  }____| \___V___| \___( Logic  }__
>   |        |    | FF | |     \     _/     | /       | /    \     _/   |
>   | Select_| ck-|>   | |      ~---'       |/        |/     /~---'     |
>   |             |____| V__________________________________/           |
>   |___________________________________________________________________V
>
> Has anyone come up with a good solution to this type of timing problem?


From: Ajay Nath <nath@ecpi.com>

John, I figured it was time I gave back to ESNUG.

This is a classic case of using path segmentation.  Path segmentation has
uses when there are multiple paths between two points, and they are
multicycle.  It is preferable over path disabling, because you still
specify timing requirements. If set_multicycle_path or set_false_path had
a '-through' option, it would solve a lot of problems like these.

To perform path segmentation you must break the combinational path (at
the bidirectional data bus, in this case).  When this is done, the break
in the path creates a new end point, and a new starting point. Timing to
(from) this end (starting) point must be specfied in order to meet path
timing requirements. But there is no through path anymore.

So, if I have interpreted Greg's problem correctly, we would end up with:

                                    Bidirectional Data Bus
   _____________________            ======================
  |       ___           ^        Enable ___        ^
  V______|   |   ____   |      __          |       |             __
   ______|MUX|__|    |  |   ,-~  \__     |\|       |   |\     ,-~  \__
  ^      |___|  |    |__|__( Logic  }____| \__A  B_V___| \___( Logic  }_
  |        |    | FF | |    \     _/     | /           | /    \     _/  |
  | Select_| ck-|>   | |     ~---'       |/            |/     /~---'    |
  |             |____| V_____________________________________/          |
  |_____________________________________________________________________V


Where A, and B are the new end and starting points. You would then
specify the required setup/hold output requirements at A, and the input
delays at B to complete the path timing picture. What this process has
done, is that it has broken the path through the data bus. Now you have
the true paths [FF to FF through logic, FF to databus, and databus to FF]
separated, and can optimize accordingly.

For a 32 bit data bus, assuming no other driver on the bidirectional data
bus in this module, and the variable 'databus' is a list of all data bits:

  foreach (databusnet, databus) {
    remove_variable driver_pin
    driver_pin = filter (all_connected(databusnet), "@pin_direction==out")
    set_input_delay  ...  [to driver_pin]
    set_output_delay ...  [from driver_pin] 
            :
    set_false_path ..     [if needed] 
    }

NOTE: The above methodology is to be applied when the design has been
mapped.  This cannot be applied on an unmapped design, since this
requires that the driver pin be a valid mapped cell pin.

You have to specify timings to A and from B, and should be able to get
these from the mapped design, if the databus is not a constrained I/O
(internal signal).  An implicit dont_touch attribute is placed on the
leaf cells associated with the start or endpoint of a (segmented) path.
[set_input_delay / set_output_delay].  This will cause a dont_touch to
be placed on the bus driver in the above example.

  - Ajay Nath
    Motorola

     ----    ----    ----    ----    ----    ----    ----    ----

From: Oren Rubinstein <oren@waterloo.hp.com>

Hello John.

The solution is called "path segmentation".  You apply a pair of
set_input_delay and set_output_delay on the same point, which creates a
new startpoint and endpoint in the middle of the path.  In this case, the
correct point to apply the commands to, is the output of the tristate driver.

Unfortunately, applying the constraints also applies an implicit
dont_touch on the tristate driver, and that prevents DC from mapping it.
Which means you have to compile twice: the first time to map it,
then you do the segmentation, then an incremental compile.

  - Oren Rubinstein
    Hewlett-Packard (Canada) Ltd. 

     ----    ----    ----    ----    ----    ----    ----    ----

From: ryan@fsd.com (Ken Ryan)

John,

We had nearly the same situation as Greg.  What you need is "set_false_path
-through", which of course doesn't exist.  We spent considerable time
looking for a way to get the same effect, without finding one.

Here are two things we did to take care of it:

1) Path segmentation.  There's a methodology note on it in the online docs,
   which is pretty good.  The node where the bidi bus connects to the block 
   can be disabled with set_input_delay and set_output_delay.  The downside
   is having to maintain consistent timing budgets, and adjust them as
   optimization warrants.  If you have a *lot* of paths like this, it can
   get out of hand quickly.

2) Nuke the bidi buses.  That's what we eventually did.  We were able to 
   make a floorplan that had contained the bidi busses in a relatively
   small area, so we restructured it to use muxes instead of tristate
   busses.  It ended up much cleaner, and as a bonus the time to do
   update_timing went down by a factor of about 8.

  - Ken Ryan                   - Glenn Griffin
    Orbital Sciences Corp.       Intrinsix Corp.


( ESNUG 249 Item 3 ) -------------------------------------------- [8/96]

From: steedman@pmcmac.pmc-sierra.bc.ca (Richard Steedman)
Subject: Huh?  Design Compiler Puts In Registers With D Inputs Tied Low???

John,

In one of my designs, I have a 32-bit registered output, the top 16 bits of
which happen to be zero, e.g:

     IF clk'EVENT AND clk = '1' THEN
        IF cond THEN
           output <= "0000000000000000" & val_a;
        ELSE
           output <= "0000000000000000" & val_b;
        END IF;
     END IF;

Synopsys seems to insist on synthesising 16 registers with inputs tied to
ground for the top 16 bits.  Is there any way to get Synopsys to blow the
registers away i.e. have 'output(31 DOWNTO 16)' directly tied to ground? 
(Incremental compiles do not seem to help.)

  - Richard Steedman
    PMC-Sierra, Inc.


( ESNUG 249 Item 4 ) -------------------------------------------- [8/96]

Subject: ( ESNUG 248 #4 ) Intermittent VHDL 3.3a Elaboration w/ Multiple Libs

> We have experienced intermittent problems with the following design
> library setup. ... The "elaborate" command occasionally fails
> to link in a component or two from a separate, but correctly
> referenced, design library.  ... Unfortunately, every time we try to
> extract a small test case from our environment the problem evaporates.
> Often the problem disappears after deleting all analyzed files and
> re-analyzing everything.  We've just been unable to isolate the factor(s)
> which trigger this problem.  ...  We'd like to know if anyone else has
> experienced this and if they have a better idea of the cause.  Our only
> workaround is to throw everything into the WORK library.


From: Victor Preis <Victor.Preis@zfe.siemens.de>

Hello, John,

We've had the same experiences with multiple libraries for Version 3.3b and
3.4b.  The only solution is that your designs are analyzed with the entity
and architecture in one (!) analyze call either by:

  - putting both in one vhdl file
                or
  - calling dc_shell> analyze ... { entity architecture }

It looks like that SYNOPSYS looses some information when anlyzing the
entity and architecture in separate calls.  One other observation on
multiple libraries: SYNOPSYS has a different LIBRARY concept than VHDL.

For SYNOPSYS the WORK library means: RAM, WORK and all which is visible
on the search path.  You will get a warning in the following set-up:

 1.) Design A contains two instantiations of design B.

 2.) A is in Library WORK, B is in Library XY.

During the elaboration, the first instantiation of B is elaborated without
any problems.  The elaboration of the second instantiation produces the
following message:

  Warning: Instance of 'B' is defined in both libraries
           'XY' and 'WORK', which are both visible in design 'A'.
           The first library will be used. (LINK-14)

This is very confusing.  You have to switch the semantics of the VHDL library
concept with each new tool (simulation vs. synthesis).  You have no control
on which design is actually read.  And it doesn't save on elaboration time
(i.e. it elaborates the identical components from the same library over and
over again!!)

  - Viktor Preis
    Siemens AG, R&D


( ESNUG 249 Item 5 ) -------------------------------------------- [8/96]

From: ryan@fsd.com (Ken Ryan)
Subject: HEEEEEELP! -- I've Got CLI or SWIFT or VSS Memory Leaks!

Hey, John!

I'm encountering a mysterious memory leak in my simulation environment.

My testbench in VSS uses RTL VHDL, SmartModels, and the CLI C interface.
I'm running Sun/SunOS4.1.4.  There is a memory leak somewhere -- the
simulator process starts out with a size of ~100 megs, but after running
overnight (about 15ms with a 1ns timebase and a 30ns system clock) the
process grows to over 600 megabytes!  Nowhere do I use dynamically allocated
variables (VHDL or C) except the instance structure in CLI:

  - The structure is allocated in the elaborate routine
    by calling cliAllocMem().
  - The structure is stored at the end of the elaborate routine
    by calling cliSaveInstanceData().
  - The structure is restored at the beginning of the evaluate routine by
    calling cliRestoreInstanceData().

The evaluate routine is called once per system clock.  I'm using the SWIFT
version R40b SmartModels, the models are upd4218160 and tms55161.  The VSS
version is v3.4a and v3.4b (tried both).  I checked the SOLVIT database and
saw the note about the TEXTIO package.  I use it in one spot, but only for
writing lines to the screen and to a disk file.  I checked over my own C
code pretty carefully, but I don't see where a leak might be.  I'm not
ruling this one out, though.

Does anyone know of any memory problems with VSS or SmartModels or CLI
function calls?  I'm out of ideas for where to look.

  - Ken Ryan
    Orbital Sciences Corp.


( ESNUG 249 Item 6 ) -------------------------------------------- [8/96]

Subject: ( ESNUG 248 #5 ) FSM Treatment Doesn't Seem Consistant Or Coherent

>I am busy with state machine synthesis and I have been fighting with design
>compiler some days in order to get the last picoseconds out of my FSM, that
>is supposed to reach 66 MHz.  ... I decided to check Design Compiler using
>an identical FSM coding by swapping the columns of my state coding.  I was
>sure to get an identical result, where the synthesized flipflops
>(SIG_st_sm_next_reg[0][1][2][3]) just changed their order.
>
>  attribute ENUM_ENCODING of sm_next_state_type : type is
>  -- order ABCD
>  --   "0011 0010 0001 1010 0000 0100 0101 1101 1100 1011 1111 1001";
>  -- order ADBC
>       "0101 0001 0100 1001 0000 0010 0110 1110 1010 1101 1111 1100";
>
>I am very unhappy to see that the IDENTICAL synthesis script on an
>IDENTICALLY coded state machine produces DIFFERENT results!  ...
>I understand that I get quite different results when using different state
>assignments.  Do you know an explanation, why the results also differ, if
>I change only the ORDER of my state vectors?


From: [ Synopsys HDL Advisor Technical Marketing ]

Dear John,

Although, I can not provide an explanation of the results of state machine
compiler I can sympathize.  The starting point for synthesis is *extremely*
important.  In fact, the Design Compiler optimization is limited by this
starting point.  This means that in many situations it is more efficient to
change the source code.  You get a better result and it may even take less
time than iterating with constraints.

Last year we released a new tool, HDL Advisor, which provides metrics
about the starting point for synthesis.  In many cases we have adjusted
the source to take into account critical paths (late arriving signals)
and made significant gains in results.  Since you are already spending
quite a bit of time in trying to meet timing, It may make sense to
see what the tool says about the source. 

  - [ Synopsys HDL Advisor Technical Marketing ]

      ----    ----    ----    ----    ----    ----    ----    ----
      ----    ----    ----    ----    ----    ----    ----    ----

 [ Editor's Note:  What follows is an exchange of e-mails between Ted
   Boydston and Deiter Peer on the synthesizing FSMs problem.  - John ]


From: tboydsto@su102s.ess.harris.com (Ted Boydston)
To: peer@iis.fhg.de (Dieter Peer)

Dieter,

If I understood the question, what you did by changing flip flops ABCD to 
flip flops ADBC is to change the sequence the machine counts.  So, yes,
order does make a difference, because the transfer function that generated
the  next state transistions had to change.  That is, a state machine with
two different state vector series, must have two diffenent next-state
transfer functions to generate the output, hence different timing parameters.
A simple example would help here:

Let's say you are making a 16 state state machine.  You could use a 4 bit 
binary counter as your state bits or a 4 bit linear shift feedback register
(LSFR) as your state bits.  Since the counter generates a binary sequence,
while the LSFR generates a psuedorandom sequence, the binary counter is
inherently much slower than the LSFR.  So, to obtain maximum performance in
this simple case, you would choose the LSFR.

By the way, your encoding method is probably not the best encoding if speed
is a concern.  My recommendation, if area is not a concern, is to use one-hot 
(or one-cold) encoding in Synopsys's FSM compiler -- this will give you the 
highest performance.  An example of one-hot in an 8 state state machine:

    00000001
    00000010
    00000100
    00001000
    00010000
    00100000
    01000000
    10000000

As you can see, each state is assigned to a flip flop, allowing for minimal
decode delay.  If area is a concern, then grey codes are the most efficent.
An example of grey codes in an 8 state state machine would be:

      001
      011
      111
      110
      100
      000
      100
      110

In this case, only one bit is changing at any one time, allowing for minimal
decode delay.

  - Theodore L. Boydston IV                  
    Harris Corporation

     ----    ----    ----    ----    ----    ----    ----    ----

To: tboydsto@su102s.ess.harris.com (Ted Boydston)
From: peer@iis.fhg.de (Dieter Peer)

Ted,

Your answer is correct, but my question was different.  (Sorry I am not a
native English speaker.)

Using one-hot for my 12-state statemachine does *not* give the best (max
speed, any area) results.  The gate library (ES2 Gate Array) provides
maximum 4-input-gates (and 6-input-complex-gates).  DC synthesizes too many
logic gates to the outputs of (here: 12) flipflops so the time thru logic
gates within the fsm limits the speed.  My best results due to speed were
achieved using 4- or 5- bit state assignment.

Lets assume your 8-state-assignment example using one-hot, naming the
flipflops A,B,C,D,E,F,G,H.  If you swap the assignment by changing some of
the columns to lets say C,D,E,F,G,H,A,B you get "another" one-hot.  But
wouldn t you expect to get identical circuitry synthsized?

   ABCDEFGH    CDEFGHAB
   00000001    00000100
   00000010    00001000
   00000100    00010000
   00001000    00100000
   00010000    01000000
   00100000    10000000
   01000000    00000001
   10000000    00000010

I would expect to get a (functionally) identical netlist, where only the
*names* of the flipflops are swapped. (A->C, B->D, .... F->H, G->A, H->B)
If you do not agree with me, do you have some advise on how to optimize the
order of 1 s in a one-hot assignment (or the gray code assignment) for the
fastest possible circuit?  The same would apply for your gray-code
assignment (ABC swapped to ACB):

   ABC    ACB
   001    010
   011    011
   111    111
   110    101
   100    100
   000    000
   100    100
   110    101

  - Dieter Peer
    Fraunhofer Gesellschaft, Germany

     ----    ----    ----    ----    ----    ----    ----    ----

From: tboydsto@su102s.ess.harris.com (Ted Boydston)
To: peer@iis.fhg.de (Dieter Peer)

Dieter,

I now see what you are saying, and for your specific column swap example on
the one-hot case I would agree, because you are not changing the actual
transition pattern (explained latter).  For the gray code case, though, I
believe there is a problem.  In this case, you performed the following
column swap:

   ABC    ACB   Transition
   000    000   1
   001    010   2
   011    011   3    
   111    111   4
   110    101   5
   100    100   6
   110    101   7
   010    001   8

If you notice, there are very few similar transitions between case ABC and
case ACB (like 3->4), while non-similar transitions dominate (1->2 and 4->5).  
Since the machine "counts" a different pattern, I believe that different 
control logic will be synthesized to produce the desired pattern, resulting
in a slight performance difference.

Another example would be a straight binary counter with a swap:

  ABC  ACB  Transition
  000  000  1
  001  010  2
  010  001  3
  011  011  4
  100  100  5 
  101  110  6
  110  101  7
  111  111  8

In this case, you can see that case ABC would synthesize to an adder and 
control logic, while case ACB would synthesize to some random logic state
machine, resulting in a signifigant performance difference.

Your one hot case of:

   ABCDEFGH   Transition  CDEFGHAB    Transition	
   00000001   1 <- reset  00000100    3 <- reset
   00000010   2           00001000    4
   00000100   3           00010000    5
   00001000   4           00100000    6
   00010000   5           01000000    7   Transition order
   00100000   6           10000000    8   does not change,
   01000000   7           00000001    1   its only shifted.
   10000000   8           00000010    2

is a special case because the transitions are exactly the same -- just
shifted--  and I would hope to see the same logic -- a shift register. :^)
Now, let's say you changed the order of the machine from ABCDEFGH to
HACFGEBD, the machine would no longer transition in the order of a shift
register, resulting in a differenct circuit.  By the way, I think if you
only shift the order of the state bits as a set, without swapping them, you
should achieve the same results with minimal performance impact in any
circuit.

As for your question of increasing performance, my first post still holds
about one-hot and gray codes.  For one hot, your mealy or moore output
should be the state, and since there is one flip flop per state, you should
not need wide combinatorial decodes to get valid state transitions.  Take
your circuit into the FSM Compiler and perform a one-hot synthesis -- that
should give you the fastest output.  You will need to modifiy your HDL a bit
to get it to read it into FSM Compiler, but overall you should not have
many problems.  I hope that you can find a solution!

  - Theodore L. Boydston IV                  
    Harris Corporation


( ESNUG 249 Item 7 ) -------------------------------------------- [8/96]

From: wendyl@brooktree.com (wendy liu)
Subject: What's The Best Way To Synthesize Multipliers In Synopsys?

Hi John,

Could you share your multiplier design experience using RTL Verilog code and
Synopsys?  I know "X=A*B;" doesn't work well.  I also know that there are
people who hand instantiate Booth decoders and adders.  What is the most
efficient and flexible way to synthesize a multiplier?

  - Wendy Liu
    Brooktree


( ESNUG 249 Item 8 ) -------------------------------------------- [8/96]

From: dmullen@net.com (David Mullenex)
Subject: Correlating A Mapped Xilinx Design to HDL Advisor 3.4b GTECH

Hi John,

Thanks for running the ESNUG, providing a forum for discussing and sharing 
experiences with the $ynop$y$ tool set.

As part of my Xilinx XC4000 FPGA design process, I was using HDL Advisor to
get a pre-mapped prediction of the results my source code would produce.  I
used info provided by the HDL Advisor to go back and refine my source code.

Now that I have synthesized my design I would like to take the mapped design 
back to HDL Advisor and try correlating the pre-mapped predictions with the 
mapped results.  Obviously, the Xilinx FPGA architecture and designware make
it hard to correlate back to the Advisor GTECH (Boolean Generic Technology).

I was hoping to use this approach with two objectives.  The first to simplify 
and somewhat automate the correlation between my mapped design and my source
code.  The second, to get a feeling for the relationship between HDL Advisor
GTECH predictions (using only two input logic functions) and the Xilinx FPGA
mapped design.

Has anyone tried this approach successfully? Is this function supported by 
HDL Advisor (It's GUI seems to indicate that it can be done.)

  - David Mullenex 
    Network Equipment Technologies

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)