( ESNUG 569 Item 2 ) -------------------------------------------- [03/29/17]
From: "Dan Joyce" <user=danj domain=correctdesigns not calm>
Subject: Dan Joyce's 29 cost-effective gate-level simulation tips (pt 1)
Hi, John,
Testing your chip's RTL with formal/lint/ABV/emulation tools -- but without
doing any Gate-Level Simulations (GLS) and then expecting all those
subsequent steps to be perfect is just rolling the dice.
Clint Eastwood: "What you have to ask yourself is..."
"Do you feel lucky?"
"Well... do you? Punk?"
Here's how I regularly catch the 15 types of chip-killer bugs which I
described in earlier in ESNUG 569 #1 that only GLS detects. Please be sure
to read that link to understand what I'm saying here.
THE GLS COST VS. BENEFIT RATIO:
Engineering is always about money. Yes, technically some of those 16 bug
types listed ESNUG 569 #1 can be caught by other methods, but they'll be
very expensive to catch with those other methods. In order to make GLS
cost effective, your verification team must have a GLS plan that:
- IS effective at finding ALL the bugs that matter before tape-out,
but
- is NOT expensive in terms of staff, tools, and compute resources.
Using these 29 tips lets you assign a single engineer to do GLS late in
your design process -- when your first gate-level netlists are coming out
of RTL synthesis -- so that he can cost-effectively catch the chip-killer
design bugs before your netlist is released to tapeout.
Many of these steps are common sense, but some of them are lessons that I've
very painfully learned over the years. And you'd be amazed at how often
engineering teams skip some or many of them.
---- ---- ---- ---- ---- ---- ----
1. Pick A Cost-Effective GLS Regression Suite
Picking the right tests is probably the most important single tip.
They need to be as short as possible, and cover all high risk areas.
This means targeted to the types of bugs GLS exposes. This includes
bugs related to reset, basic communication with sub-blocks (liveness),
and clocking issues. This also means analyzing the cost vs. risk of
certain key blocks, and the goals and time constraints of the chip.
- Do A Full Reset Initialization in GLS
Perhaps the most important test for GLS is reset-initialization of
your chip. RTL tests often run with a BFM stubbing out the processor
to make testing easier. There are usually "forces" to skip long
initialization sequences that are not needed for every test. But
those BFMs and forces can be masking a critical bug that keeps the
chip from getting through boot sequence.
Once booted, software workarounds are possible, so it is critically
important to get hardware in the lab through the boot sequence. Most
chips implement multiple backup boot options.
Each of these should be run in GLS. This may require C code or
firmware development, which is hard to create and hard to debug.
But that code typically gets used again in the lab by the chip
validation team. At least one good reset initialization test is
needed with no internal BFMs or forces.
- You Must Test ALL Blocks For Liveness in GLS
This isn't rocket science; it's making sure you do a quick all
systems check throughout your entire chip.
All major blocks of the chip should be brought out of reset, then
initialized, and then run through basic functional tests in GLS.
All state machines should be cycled through a basic operation,
including those located in PHY's and integrated IP.
Write/Read access of DDR, SRAMs, and on-chip registers from each
initiator in the chip is also important.
All SRAMs and FIFOs should be exercised at least partially.
There may be some tough choices here -- especially with IP.
If external IP has a high confidence of working, the argument
could be made to skip GLS for those blocks. However, even IP with
high confidence has been found to have been configured or
synthesized incorrectly and caused late bugs that are only found
with GLS.
- Clocks, Clock Modes, Operate at Max frequency in GLS
Run basic Write/Read and Data movement tests across the chip
in all key clocking modes.
- Low Frequency Test Mode in GLS
Create a test mode that puts all tests in slow clocks mode for
initial flow creation of the SDF back annotation for Timing GLS
before the netlist is setup clean.
- Use Short-But-Efficient GLS Tests
Leverage tests from RTL but shorten them -- reduce loop counts.
Coverage should be targeted at toggle -- all bits in a data path
must toggle -- especially for components likely to have timing
issues like PHY's and SRAMs. But GLS is not good for exhaustive
coverage. Corner case bugs are unlikely to exist only in gate
level simulation. Looking for them with GLS is not advised.
---- ---- ---- ---- ---- ---- ----
2. Simulate Using Three Models (RTL, Gate, SDF)
Cost-effective GLS requires a debug methodology which makes it
fast and easy to isolate failures to the actual gate or D-FF quickly.
Since tests are often leveraged from other teams, the GLS engineer is
usually debugging tests written by someone else, often in logic the
GLS engineer is not familiar with. Debugging RTL tests in this way is
difficult, but in gates it can be impossible.
Your simulation environments should be setup to allow each test to run
on 3 different models for the DUT (Device Under Test). You use the
RTL model to develop the verification infrastructure for creating and
getting familiar with waves, monitors, checkers, breakpoints and log
messages.
- RTL DUT: use for test generation, and initial passing waves
Has fastest compile and run -- especially when dumping waves.
- GATE DUT with 0-delay: used to optimize simulator performance
with this simulation model.
Has a slower compile and run -- especially when dumping waves.
- GATE DUT with SDF
Slowest compile by far. Simulation speed a little worse than 0-delay
Gates. Many teams are skipping SDF GLS. It is hard and many teams
don't finish before tapeout anyway. Following these 4 tips below
to do SDF with reasonable effort and in time for tape-out:
A) Create a Hold-Fix script and slow-clocks tests for initial SDF
annotation process development before netlist is hold or setup
clean. This allows an earlier start to SDF Timing simulation.
B) Run SDF with Timing Checks Enabled. Timing checks in SDF cause
the output of the DFF to go to X if the Din changes too close
to the clock edge. This X propagates through the chip and X's
are easier to trace in waves than trying to compare two wave files
for differences. The timing error gets printed to the log file.
C) Use Limited Simulation Environments. While internal BFMs are
useful with GLS tests, internal BFMs are often difficult to get
to work with SDF timing. BFMs using SystemVerilog interfaces
with clocking blocks make BFMs in SDF easier, but still hard.
D) Compile the SDF file to speed-up compilation turn time.
OVERALL: Never run a test on a more complex DUT until it is passing on
the simpler, easier to debug model. Tests should be developed and
debugged on the easiest environment to run and debug - the RTL DUT
environment before moving those tests to simulating on a slower, more
complicated to debug gate netlist. Debug initial GLS issues (libraries
for compile, X's, performance etc.) on the smallest simplest zero delay
or unit delay GLS netlist before moving on to SDF Timing simulations.
---- ---- ---- ---- ---- ---- ----
3. Simulation Performance - Compile Times for GLS
Turn time is the amount of time needed to recompile a model and rerun
a test. This drastically affects the calendar time needed to work
through GLS failures.
Gate model compiles are much longer than RTL compiles.
In a large chip if the compile time for the testbench and RTL DUT is
about 15 minutes at the full chip level, for 0-delay Gate netlist it
will likely be closer to 1 hour, and for gate netlists with SDF timing
it will typically run 4 - 6 hours. That's 4x slower for gate and 16x
24x slower with SDF. You must put the work in early to reduce the turn
time of your 0-delay and SDF Gate level models. At the tail end of your
project the likelihood of finishing before tapeout is likely determined
by your turn time. If this time is too long, it will be necessary to
create additional smaller block-level GLS environments.
- Use Sub-Unit GLS Where Needed
Chips with DDR are always a GLS challenge due to the complexity of
simulating their PHY in Gates, and due to the fine grained timing
checkers typically found in the DDR Verification IP. A sub-unit GLS
environment for a single DDR controller and PHY allows a much faster
turn time and allows the ability to work through all the GLS issues
faster and easier. The sub-unit GLS env can then be leveraged into
the chip level GLS.
- Do Your SDF Compile Once and Ahead of Time
Compiling your SDF annotation file reduces your overall DUT compile
with SDF by around 25%. This is done once per-netlist-SDF-drop and
pays for itself quickly.
- Stub Off Blocks in GLS With Long Initialization Times
DDR initialization can be extremely long in any simulation. Stubbing
the DDR unit with a DDR BFM allows all tests except those focused on
DDR to skip this simulation time consuming step.
- Use Common Sense To Partition Your Compiles
Since most changes requiring a recompile are in the testbench, and
the majority of the compile time is spent in the DUT, it can be
extremely helpful to create an environment that is split into two
partitions -- testbench and DUT -- assuming this feature is supported
by your Verilog/VHDL simulator. CAVEAT EMPTOR: all the the simulator
vendors will tell you they can partition compiles like this, but the
truth is some do it better than others; and it must be stable to use.
- Not All Verilog/VHDL Vendors Have The Memory Capacity
Simulators running Gate netlists have made massive improvements in
handling very large designs and compiling with reasonable performance;
but they need massive amounts of RAM to do so -- especially for GLS
SDF compiles. CAVEAT EMPTOR: you must measure the RAM requirements of
your simulator yourself given your gate-level netlist, and make sure
you have the ability to issue gatesim compiles to machines that won't
have other jobs competing for that RAM. Also make sure your jobs
won't stomp on other jobs. Not all simulators have the capacities
or abilities that their salesmen claim they have..
---- ---- ---- ---- ---- ---- ----
4. Simulation Performance - Runtimes for GLS
Getting good enough simulation speed is probably the biggest reason why
GLS is such a challenge.
Long test GLS runs cause progress to move extremely slowly. Tests need
to be short enough to run typically in a few hours, so that when dumping
waves most tests can complete overnight. There will always be *some*
really long tests for a full reset initialization without shortcuts, or
full chip high activity tests that cannot be shortened.
- GLS Performance Test the DUT, Testbench and Servers
The first step in a GLS plan is to identify simulation speed for the
gate level model in GLS. The VCS/Questa/IES Verilog simulators vary
widely in their ability to handle large designs. GLS pushes the size
limits of the simulator.
Historically simulation speed has hit a wall at a certain size, but
recent tool advances have brought simulators that are more scalable on
servers with a lot of RAM. If the simulator RAM size is below the
limit of the simulator, then it seems that switching activity is
the key to GLS performance. GLS tests that cause a great deal of
activity tend to run the longest.
One trick a verification team will do is run a few very big servers
with lots of CPUs and memory for long running GLS tests.
Another trick is running on the local disk to speed-up GLS when it's
dumping waves because it takes the high bandwidth off the server
interconnect. This helps your GLS test, but can also help keep
your GLS tests from affecting other simulations running on the
compute farm.
- Regression Schedulers, Identify Limiters, & "RAM or IO?"
Make sure your regression scheduler and run-script are designed to
handle GLS test runs. The memory needs must be identified for these
large simulations, and the jobs scheduler must make sure the machines
running these jobs will provide your simulation with the needed RAM
and IO bandwidth -- or else page swapping and IO bottlenecks will
destroy the performance for both your GLS tests and for other tests
running on those machines.
- Create and Tag a Few Very Short Key GLS Tests
It's a milestone to run your first GLS on your chip. Other tests
should build on the starting point established by this first test.
Each new netlist should be run first on a few very short tests that
get the DUT out of reset and do some basic liveness testing.
- Stub-Out the Large and/or High Activity Blocks
Use internal BFMs to stub-out blocks with size and high activity from
your compile. The best candidates for this are usually things like
DDR, PCIe, and large replicated logic.
- Hunt for the GLS-friendly Simulator Switches
Each simulation tool (VCS from Synopsys, IES from Cadence, Questa
from Mentor) has many proprietary simulator switches. Make it a
point to discuss with the SNPS/CDNS/MENT AE's what the optimal
performance switches for both zero/unit GLS and for SDF timing
GLS for the rev of simulator you're using. CAVEAT EMPTOR: Usually
zero delay is faster than unit delay, but more work may be needed
to solve any delta-delay race condition issues that only occur in
zero delay -- or in badly implemented unit delay.
- Surgical Dumping - Time/Hierarchical Slicing without Recompile
Many failing tests will require a wave file to debug. Dumping the
entire hierarchy from time 0 slows down your simulation way more in
gates than in RTL. Dump only what is needed for the debug.
Also since gate compile times are much longer than RTL, you want your
wave dumping to be flexible such that dumping can be changed without
having to recompile the entire testbench and design.
Instead, create a configurable way to dump various levels and regions
of your design without recompile. Create the ability to turn on the
dumping at a different time without recompile. WARNING: Be aware
that dumping can cause certain GLS simulator optimizations to be
turned off. This has two effects:
1) Performance - slower run with dumping
2) Different functionality - occasionally simulators make mistakes,
and these tend to be from performance optimizations.
To keep consistancy, it's much better to run with the same simulator
optimization while either dumping or not dumping. If you find a
simulator bug, try changing optimizations. Also let the vendor know
because they may have a fix for it.
IMPORTANT: Run dumping with your library cells turned off. Dumping
library cells in GLS slows down simulations by as much as 2X.
Struct dumping is OK, but definitely make sure Multi-Dimensional-
Array dumping is off (these are your massive RAMs).
---- ---- ---- ---- ---- ---- ----
5. The Tradeoffs with Sub-Unit Gatesim Testing
Many teams do GLS at the unit level for *all* blocks of the DUT. It's
a cost-benefit decision. While forcing each sub-unit owner to do GLS
costs in time and engineering effort, the upside is it allows people
more familiar with the logic and its tests to debug hard-to-find GLS
bugs in a much smaller and faster environment. This downside is it
results in many more GLS testbenches because the chip level GLS
testing is still required -- at least for the reset initialization.
- Sub-Unit GLS for Your Few High Risk Blocks
To save on engineering cost, some teams only do sub-unit GLS testing
on those few key blocks which they feel are risky or might have
potential performance issues. These are blocks which are notoriously
risky for having design escapes in large SoCs. Unknown/iffy IPs,
PHYs, DDR, and high speed serial logic are at the top of the list.
Since this is where trouble is expected, it makes sense for your team
to put in the extra man-hours to make sure these problem blocks are
given a GLS environment with a fast turn time and high simulation
speed in order to work through all their expected issues efficiently.
---- ---- ---- ---- ---- ---- ----
6. Functional Testing, GLS, and internal BFMs
Environments that employ internal BFMs use to hijack an internal DUT
interface at the boundary of a block to simplify generation of test
stimulus. The assumption is that the block will work exactly the
same as the BFM in silicon. Leveraging existing tests for use in GLS
is critical to making GLS effort cost-effective. Having to rewrite new
tests to run without a BFM is extremely expensive. Adding BFMs into
your GLS testbenches is tricky however, especially for SDF GLS, but it
is possible. System Verilog adds interfaces with clocking blocks which
allow set-up and hold application for entire groups of signals very
easily, and makes BFMs in GLS to be quite doable.
- Leverage Existing Tests and Easier Test Generation
Grabbing an existing set of tests is much easier than generating
a test suite from scratch. Even if existing tests do not exist,
generating stimulus from a BFM is much simpler to create, maintain
and debug. WARNING: But you're trusting your BFM. Neither GLS nor
RTL simulation will catch if your BSM has subtle errors in it.
- Shorter Tests
Tests using BFMs are shorter because they can skip the initialization
of the logic they replace, and the model simulates faster with a BFM
in place of your design block in gates. WARNING: Using BFMs bypasses
reset initialization testing of those blocks on your chip -- and often
where chip-killer bugs lurk.
BFMs make tests quick and faster to create, but they do it by mimicking
blocks where timing & functional bugs could be hiding.
---- ---- ---- ---- ---- ---- ----
Related Articles
Dan Joyce's 16 bug types only found with gate-level simulation
Dan Joyce's 29 cost-effective gate-level simulation tips (pt 1)
Dan Joyce's 29 cost-effective gate-level simulation tips (pt 2)
Dan Joyce's 29 cost-effective gate-level simulation tips (pt 3)
Join
Index
Next->Item
|
|