Dan Joyce's 29 tips for gate-level simulation

( ESNUG 569 Item 2 ) -------------------------------------------- [03/29/17]

From: "Dan Joyce" <user=danj domain=correctdesigns not calm>
Subject: Dan Joyce's 29 cost-effective gate-level simulation tips (pt 1)

Hi, John,

Testing your chip's RTL with formal/lint/ABV/emulation tools -- but without
doing any Gate-Level Simulations (GLS) and then expecting all those
subsequent steps to be perfect is just rolling the dice.

       Clint Eastwood: "What you have to ask yourself is..."
                       "Do you feel lucky?"
                       "Well... do you?  Punk?"

Here's how I regularly catch the 15 types of chip-killer bugs which I
described in earlier in ESNUG 569 #1 that only GLS detects.  Please be sure
to read that link to understand what I'm saying here.


THE GLS COST VS. BENEFIT RATIO:

Engineering is always about money.  Yes, technically some of those 16 bug
types listed ESNUG 569 #1 can be caught by other methods, but they'll be
very expensive to catch with those other methods.  In order to make GLS
cost effective, your verification team must have a GLS plan that:

  - IS effective at finding ALL the bugs that matter before tape-out,

but

  - is NOT expensive in terms of staff, tools, and compute resources.

Using these 29 tips lets you assign a single engineer to do GLS late in
your design process -- when your first gate-level netlists are coming out
of RTL synthesis -- so that he can cost-effectively catch the chip-killer
design bugs before your netlist is released to tapeout.

Many of these steps are common sense, but some of them are lessons that I've
very painfully learned over the years.  And you'd be amazed at how often
engineering teams skip some or many of them.

        ----    ----    ----    ----    ----    ----    ----

 1. Pick A Cost-Effective GLS Regression Suite
    Picking the right tests is probably the most important single tip.
    They need to be as short as possible, and cover all high risk areas.
    This means targeted to the types of bugs GLS exposes.  This includes
    bugs related to reset, basic communication with sub-blocks (liveness),
    and clocking issues.  This also means analyzing the cost vs. risk of
    certain key blocks, and the goals and time constraints of the chip.

    - Do A Full Reset Initialization in GLS
      Perhaps the most important test for GLS is reset-initialization of
      your chip.  RTL tests often run with a BFM stubbing out the processor
      to make testing easier.  There are usually "forces" to skip long
      initialization sequences that are not needed for every test.  But
      those BFMs and forces can be masking a critical bug that keeps the
      chip from getting through boot sequence.

      Once booted, software workarounds are possible, so it is critically
      important to get hardware in the lab through the boot sequence.  Most
      chips implement multiple backup boot options.

      Each of these should be run in GLS.  This may require C code or
      firmware development, which is hard to create and hard to debug.
      But that code typically gets used again in the lab by the chip
      validation team.  At least one good reset initialization test is
      needed with no internal BFMs or forces.

    - You Must Test ALL Blocks For Liveness in GLS
      This isn't rocket science; it's making sure you do a quick all
      systems check throughout your entire chip.

      All major blocks of the chip should be brought out of reset, then
      initialized, and then run through basic functional tests in GLS.

      All state machines should be cycled through a basic operation,
      including those located in PHY's and integrated IP.

      Write/Read access of DDR, SRAMs, and on-chip registers from each
      initiator in the chip is also important.

      All SRAMs and FIFOs should be exercised at least partially.
      There may be some tough choices here -- especially with IP.

      If external IP has a high confidence of working, the argument
      could be made to skip GLS for those blocks.  However, even IP with
      high confidence has been found to have been configured or
      synthesized incorrectly and caused late bugs that are only found
      with GLS.

    - Clocks, Clock Modes, Operate at Max frequency in GLS
      Run basic Write/Read and Data movement tests across the chip
      in all key clocking modes.

    - Low Frequency Test Mode in GLS
      Create a test mode that puts all tests in slow clocks mode for
      initial flow creation of the SDF back annotation for Timing GLS
      before the netlist is setup clean.

    - Use Short-But-Efficient GLS Tests
      Leverage tests from RTL but shorten them -- reduce loop counts.
      Coverage should be targeted at toggle -- all bits in a data path
      must toggle -- especially for components likely to have timing
      issues like PHY's and SRAMs.  But GLS is not good for exhaustive
      coverage.  Corner case bugs are unlikely to exist only in gate
      level simulation.  Looking for them with GLS is not advised.

        ----    ----    ----    ----    ----    ----    ----

 2. Simulate Using Three Models (RTL, Gate, SDF)
    Cost-effective GLS requires a debug methodology which makes it
    fast and easy to isolate failures to the actual gate or D-FF quickly.

    Since tests are often leveraged from other teams, the GLS engineer is
    usually debugging tests written by someone else, often in logic the
    GLS engineer is not familiar with.  Debugging RTL tests in this way is
    difficult, but in gates it can be impossible.

    Your simulation environments should be setup to allow each test to run
    on 3 different models for the DUT (Device Under Test).  You use the
    RTL model to develop the verification infrastructure for creating and
    getting familiar with waves, monitors, checkers, breakpoints and log
    messages. 

    - RTL DUT: use for test generation, and initial passing waves
      Has fastest compile and run -- especially when dumping waves.

    - GATE DUT with 0-delay: used to optimize simulator performance
      with this simulation model.
      Has a slower compile and run -- especially when dumping waves.

    - GATE DUT with SDF
      Slowest compile by far.  Simulation speed a little worse than 0-delay
      Gates.  Many teams are skipping SDF GLS.  It is hard and many teams
      don't finish before tapeout anyway.  Following these 4 tips below 
      to do SDF with reasonable effort and in time for tape-out:

      A) Create a Hold-Fix script and slow-clocks tests for initial SDF
         annotation process development before netlist is hold or setup
         clean. This allows an earlier start to SDF Timing simulation.
      B) Run SDF with Timing Checks Enabled.  Timing checks in SDF cause
         the output of the DFF to go to X if the Din changes too close
         to the clock edge.  This X propagates through the chip and X's
         are easier to trace in waves than trying to compare two wave files
         for differences.  The timing error gets printed to the log file.
      C) Use Limited Simulation Environments.  While internal BFMs are
         useful with GLS tests, internal BFMs are often difficult to get
         to work with SDF timing.  BFMs using SystemVerilog interfaces
         with clocking blocks make BFMs in SDF easier, but still hard.

      D) Compile the SDF file to speed-up compilation turn time.

    OVERALL: Never run a test on a more complex DUT until it is passing on
    the simpler, easier to debug model.  Tests should be developed and
    debugged on the easiest environment to run and debug - the RTL DUT
    environment before moving those tests to simulating on a slower, more
    complicated to debug gate netlist.  Debug initial GLS issues (libraries
    for compile, X's, performance etc.) on the smallest simplest zero delay
    or unit delay GLS netlist before moving on to SDF Timing simulations.

        ----    ----    ----    ----    ----    ----    ----

 3. Simulation Performance - Compile Times for GLS
    Turn time is the amount of time needed to recompile a model and rerun
    a test.  This drastically affects the calendar time needed to work
    through GLS failures.

    Gate model compiles are much longer than RTL compiles.

    In a large chip if the compile time for the testbench and RTL DUT is
    about 15 minutes at the full chip level, for 0-delay Gate netlist it
    will likely be closer to 1 hour, and for gate netlists with SDF timing
    it will typically run 4 - 6 hours.  That's 4x slower for gate and 16x
    24x slower with SDF.  You must put the work in early to reduce the turn
    time of your 0-delay and SDF Gate level models.  At the tail end of your
    project the likelihood of finishing before tapeout is likely determined
    by your turn time.  If this time is too long, it will be necessary to
    create additional smaller block-level GLS environments.

    - Use Sub-Unit GLS Where Needed
      Chips with DDR are always a GLS challenge due to the complexity of
      simulating their PHY in Gates, and due to the fine grained timing
      checkers typically found in the DDR Verification IP.  A sub-unit GLS
      environment for a single DDR controller and PHY allows a much faster
      turn time and allows the ability to work through all the GLS issues
      faster and easier.  The sub-unit GLS env can then be leveraged into
      the chip level GLS.

    - Do Your SDF Compile Once and Ahead of Time
      Compiling your SDF annotation file reduces your overall DUT compile
      with SDF by around 25%.  This is done once per-netlist-SDF-drop and
      pays for itself quickly.

    - Stub Off Blocks in GLS With Long Initialization Times
      DDR initialization can be extremely long in any simulation.  Stubbing
      the DDR unit with a DDR BFM allows all tests except those focused on
      DDR to skip this simulation time consuming step.

    - Use Common Sense To Partition Your Compiles
      Since most changes requiring a recompile are in the testbench, and
      the majority of the compile time is spent in the DUT, it can be
      extremely helpful to create an environment that is split into two
      partitions -- testbench and DUT -- assuming this feature is supported
      by your Verilog/VHDL simulator.  CAVEAT EMPTOR: all the the simulator
      vendors will tell you they can partition compiles like this, but the
      truth is some do it better than others; and it must be stable to use.

    - Not All Verilog/VHDL Vendors Have The Memory Capacity
      Simulators running Gate netlists have made massive improvements in
      handling very large designs and compiling with reasonable performance;
      but they need massive amounts of RAM to do so -- especially for GLS
      SDF compiles.  CAVEAT EMPTOR: you must measure the RAM requirements of
      your simulator yourself given your gate-level netlist, and make sure
      you have the ability to issue gatesim compiles to machines that won't
      have other jobs competing for that RAM.  Also make sure your jobs
      won't stomp on other jobs.  Not all simulators have the capacities
      or abilities that their salesmen claim they have..

        ----    ----    ----    ----    ----    ----    ----

 4. Simulation Performance - Runtimes for GLS
    Getting good enough simulation speed is probably the biggest reason why
    GLS is such a challenge.

    Long test GLS runs cause progress to move extremely slowly.  Tests need
    to be short enough to run typically in a few hours, so that when dumping
    waves most tests can complete overnight.  There will always be *some*
    really long tests for a full reset initialization without shortcuts, or
    full chip high activity tests that cannot be shortened.

    - GLS Performance Test the DUT, Testbench and Servers
      The first step in a GLS plan is to identify simulation speed for the
      gate level model in GLS.  The VCS/Questa/IES Verilog simulators vary
      widely in their ability to handle large designs.  GLS pushes the size
      limits of the simulator.

      Historically simulation speed has hit a wall at a certain size, but
      recent tool advances have brought simulators that are more scalable on
      servers with a lot of RAM.  If the simulator RAM size is below the
      limit of the simulator, then it seems that switching activity is
      the key to GLS performance.  GLS tests that cause a great deal of
      activity tend to run the longest.

      One trick a verification team will do is run a few very big servers
      with lots of CPUs and memory for long running GLS tests.

      Another trick is running on the local disk to speed-up GLS when it's
      dumping waves because it takes the high bandwidth off the server
      interconnect.  This helps your GLS test, but can also help keep
      your GLS tests from affecting other simulations running on the
      compute farm.

   -  Regression Schedulers, Identify Limiters, & "RAM or IO?"
      Make sure your regression scheduler and run-script are designed to
      handle GLS test runs.  The memory needs must be identified for these
      large simulations, and the jobs scheduler must make sure the machines
      running these jobs will provide your simulation with the needed RAM
      and IO bandwidth -- or else page swapping and IO bottlenecks will
      destroy the performance for both your GLS tests and for other tests
      running on those machines.

    - Create and Tag a Few Very Short Key GLS Tests
      It's a milestone to run your first GLS on your chip.  Other tests
      should build on the starting point established by this first test.
      Each new netlist should be run first on a few very short tests that
      get the DUT out of reset and do some basic liveness testing.

    - Stub-Out the Large and/or High Activity Blocks
      Use internal BFMs to stub-out blocks with size and high activity from
      your compile.  The best candidates for this are usually things like
      DDR, PCIe, and large replicated logic.

   -  Hunt for the GLS-friendly Simulator Switches
      Each simulation tool (VCS from Synopsys, IES from Cadence, Questa
      from Mentor) has many proprietary simulator switches.  Make it a
      point to discuss with the SNPS/CDNS/MENT AE's what the optimal
      performance switches for both zero/unit GLS and for SDF timing
      GLS for the rev of simulator you're using.  CAVEAT EMPTOR: Usually
      zero delay is faster than unit delay, but more work may be needed
      to solve any delta-delay race condition issues that only occur in
      zero delay -- or in badly implemented unit delay.

    - Surgical Dumping - Time/Hierarchical Slicing without Recompile
      Many failing tests will require a wave file to debug.  Dumping the
      entire hierarchy from time 0 slows down your simulation way more in
      gates than in RTL.  Dump only what is needed for the debug.

      Also since gate compile times are much longer than RTL, you want your
      wave dumping to be flexible such that dumping can be changed without
      having to recompile the entire testbench and design.

      Instead, create a configurable way to dump various levels and regions
      of your design without recompile.  Create the ability to turn on the
      dumping at a different time without recompile.  WARNING: Be aware
      that dumping can cause certain GLS simulator optimizations to be
      turned off.  This has two effects:

       1) Performance - slower run with dumping
       2) Different functionality - occasionally simulators make mistakes,
          and these tend to be from performance optimizations.

      To keep consistancy, it's much better to run with the same simulator
      optimization while either dumping or not dumping.  If you find a
      simulator bug, try changing optimizations.  Also let the vendor know
      because they may have a fix for it.

      IMPORTANT: Run dumping with your library cells turned off.  Dumping
      library cells in GLS slows down simulations by as much as 2X.
      Struct dumping is OK, but definitely make sure Multi-Dimensional-
      Array dumping is off (these are your massive RAMs).

        ----    ----    ----    ----    ----    ----    ----

 5. The Tradeoffs with Sub-Unit Gatesim Testing
    Many teams do GLS at the unit level for *all* blocks of the DUT.  It's
    a cost-benefit decision.  While forcing each sub-unit owner to do GLS
    costs in time and engineering effort, the upside is it allows people
    more familiar with the logic and its tests to debug hard-to-find GLS
    bugs in a much smaller and faster environment.  This downside is it
    results in many more GLS testbenches because the chip level GLS
    testing is still required -- at least for the reset initialization.

    - Sub-Unit GLS for Your Few High Risk Blocks
      To save on engineering cost, some teams only do sub-unit GLS testing
      on those few key blocks which they feel are risky or might have
      potential performance issues.  These are blocks which are notoriously
      risky for having design escapes in large SoCs.  Unknown/iffy IPs,
      PHYs, DDR, and high speed serial logic are at the top of the list.

      Since this is where trouble is expected, it makes sense for your team
      to put in the extra man-hours to make sure these problem blocks are
      given a GLS environment with a fast turn time and high simulation
      speed in order to work through all their expected issues efficiently.

        ----    ----    ----    ----    ----    ----    ----

 6. Functional Testing, GLS, and internal BFMs
    Environments that employ internal BFMs use to hijack an internal DUT
    interface at the boundary of a block to simplify generation of test
    stimulus.  The assumption is that the block will work exactly the
    same as the BFM in silicon.  Leveraging existing tests for use in GLS
    is critical to making GLS effort cost-effective.  Having to rewrite new
    tests to run without a BFM is extremely expensive.  Adding BFMs into
    your GLS testbenches is tricky however, especially for SDF GLS, but it
    is possible.  System Verilog adds interfaces with clocking blocks which
    allow set-up and hold application for entire groups of signals very
    easily, and makes BFMs in GLS to be quite doable.

    - Leverage Existing Tests and Easier Test Generation
      Grabbing an existing set of tests is much easier than generating
      a test suite from scratch.  Even if existing tests do not exist,
      generating stimulus from a BFM is much simpler to create, maintain
      and debug.  WARNING: But you're trusting your BFM.  Neither GLS nor
      RTL simulation will catch if your BSM has subtle errors in it.

    - Shorter Tests
      Tests using BFMs are shorter because they can skip the initialization
      of the logic they replace, and the model simulates faster with a BFM
      in place of your design block in gates.  WARNING: Using BFMs bypasses
      reset initialization testing of those blocks on your chip -- and often
      where chip-killer bugs lurk.

    BFMs make tests quick and faster to create, but they do it by mimicking
    blocks where timing & functional bugs could be hiding.

        ----    ----    ----    ----    ----    ----    ----

Related Articles

    Dan Joyce's 16 bug types only found with gate-level simulation
    Dan Joyce's 29 cost-effective gate-level simulation tips (pt 1)
    Dan Joyce's 29 cost-effective gate-level simulation tips (pt 2)
    Dan Joyce's 29 cost-effective gate-level simulation tips (pt 3)

Join Index Next->Item

"Relax. This is a discussion. Anything said here is just one engineer's opinion. Email in your dissenting letter and it'll be published, too."
This Web Site Is Modified Every 2-3 Days
Copyright 1991-2024 John Cooley. All Rights Reserved.
| Contact John Cooley | Webmaster | Legal | Feedback Form |


   !!!     "It's not a BUG,
  /o o\  /  it's a FEATURE!"
 (  >  )
  \ - / 
  _] [_     (jcooley 1991)