( ESNUG 331 Item 1 ) --------------------------------------------- [10/7/99]
From: "Sam Appleton" <sama@groovy.mti.sgi.com>
Subject: One User's Review Of The Synopsys FlexRoute ("Everest TLR") Tool
Dear John,
I know you like user reviews of tools. But, before I go into experiences
using Everest TLR / Synopsys FlexRoute, I've got to explain the physical
landscape of the chip it was used on. Our chip, Krypton, is 3 million
gates, 133 Mhz, and it's an in-socket replacement for an older chip. Our
design goal was to add more functionality, 33 percent higher speed and,
therefore, significantly improve our transistor density by at least 20
percent -- hence our focus on the newer, supposedly better P&R tools.
Krypton had 23 major sub-blocks (with 7,000 nets between them), two top
level 256-bit global buses and four global 128-bit buses.
The total "RAM area" was perhaps 10-20% of the chip (although I can't
measure precisely by just eyeballing a plot) with 47 RAM/RF instances.
The bulk of the RF's were Artisan 32x32's and 32x64's.
Krypton had 751,830 placeable instances to place-and-route.
The Five Big Flat Layout "Hells"
--------------------------------
I know it's very common for some companies to do layout as a process on one
big flat design. We considered flat, but these five "hells" came up:
o big flat designs are run-time hell
We had five copies of Cadence Silicon Ensemble (rev 5.1). Placing and
routing a 3 Mgate ASIC would have a huge runtime (guessing 3-4 days or
more), making iterations very difficult and time consuming. Block
logic grouping and global wiring patterns becomes intractable. Timing
convergence and signal integrity goals becomes a complete mess.
o big flat designs are extraction hell
Once a final layout GDS/DEF is obtained, the design must be extracted
and passed through static timing to verify timing behavior. Based on
our experience with Avanti Star-RC rev 98.5, extraction of our 3 Mgate
ASIC flat was very difficult and it typically took 1-2 weeks just to
obtain good a SPF. In addition, the 2 giga byte size limit imposed by
Solaris 2.5 prohibits the filesizes required for a 3 Mgate SPF, unless
the design is extracted hierachially.
o big flat designs are back-annotation hell
Once our full flat chip SPF has been generated, it must be accurately
converted to SDF. Even if a full flat SPF can be generated, generating
an SDF from this file was impossible with Avanti Star-DC rev 98.5, so
we used Ultima Millennium rev 1.8 from <hhtp://www.ultimatech.com> .
We liked Ultima. We could extract all the blocks in our chip on an
individual basis and then Ultima would merge all these SPFs into one
big SDF. We could then load that into Primetime and have a ball doing
timing analysis. Good stuff.
o big flat designs are clock tree hell
Inserting a clock into a flat design that is 11 x 11 mm is a bad joke
from almost every perspective. In a big flat design, our signal wires
in initial levels of the tree were very long, making predictability of
delay in the first stages of the clock very hard. Too many crosstalk
problems, too. The tool we had to use, Cadence's CTgen rev. 3.3,
ability to analyze and insert a clock with a network of 500K flip-flops
over such an area sucked. Overall, CTgen sucked. To insert clocks, I
had to write my own tool with a few thousand lines of Perl to do the
final clock insertion.
o big flat designs are timing closure hell
John, as your readers know, Design Compiler *estimates* wireloads.
Once our flat design was routed, fixing the small errors due to the
differences between *estimated* wireloads and the real design meant
either IPO netlist iterations or our whole design had to be redone
starting from the floorplan. Either way, our full design must still be
re-extracted & re-annotated -- making a very long timing closure loop.
In practical terms, with engineers here running around tweaking and pumping
netlists out of Design Compiler every day, some way to compartmentalize
their work is a MUST.
So, obviously John, we chose the hierarchical approach.
Choosing Routing Tools
----------------------
As we began developing our CAD flow, we looked at the grid-based routers.
We eliminated Avanti for this review because we've had such spotty support
from them in the past and recently. This left only Cadence WarpRoute and
Compass PathFinder to review. We found:
o no facilities for manual editing and route control
o they're are not tuned to block-based designs
o major buses cannot be controlled or planned
o special nets like clocks and noise-sensitive signals can not be
adequately controlled
o variable width and spacing on nets is very limited
For top-level block-based routers, we evaluated Everest TLR (which is now
Synopsys FlexRoute) and Cadence IC Craftsman. After using IC Craftsman
for a couple of weeks, we very quickly gave up once we saw Everest. IC
Craftsman was a pain. Hard to get data in and out, hard to control busses
(or anything for that matter), and it was slow. Everest had an easier GUI,
it worked faster, and it did all that stuff I listed above plus:
o block pin optimization and control
o controllable bus routing
o top-level floorplanning
o shielding and length balancing
o easy net/bus prerouting
So we chose to use Everest TLR / Synopsys FlexRoute.
Our Experiences With FlexRoute
------------------------------
First off, the ASCII text format of FlexRoute's database made it very easy
to interface to other place-and-route tools, as well as do design revision
control with. No cryptic binaries or proprietary formats to "manage".
Second, FlexRoute's Perl interface also allowed us to easily automate many
repetitive functions as well as add custom functionality that was unique
to our environment. FlexRoute's interface itself provides full access to
the tools' database for manipulation -- a very handy feature for those
wishing to customize the tool for their particular flow.
On to the nitty-gritty...
Since FlexRoute is for hierachial chip assembly, it fit into our overall
hierarchy-preferred approach quite well. We split our design along sections
of the logical hierachy, using 23 top-level blocks, each of which was a
place-and-route "unit". Some of these blocks also had sub-units (either
custom logic or other place-and-route units), giving a 3-level hierachy in
parts of the design. This enabled an incredible acceleration of our design
cycle:
o smaller place-and-route units
Although there were many blocks, the layout generation time for each
was very small, ranging from 1 to 8 hours. Using our five Silicon
Ensemble licenses enabled many blocks to be turned around in parallel
(impossible with a flat approach). Design management for these blocks
was much easier. Clock Insertion (using our homebrew tool) on these
smaller blocks gave excellent skew performance and a relatively shallow
tree. It also allowed better control of crosstalk issues on these
critical signals.
o hierachial extraction got easy
We set up our extraction tool, Avanti Star-RC, to operate on the top
level layout only to the interface "ports" of each block. This
massively improved top-level extraction time as well as making basic
chip-level LVS faster. Extraction time went from 10 days (because
Star-RC kept crashing) to about 4 hours using it hierarchically.
Our delay calculator, Ultima Millennium, would merge hierachial
extraction results and give a full-chip SDF for back-annotation.
Changing one part of the design required only that part of the design
be re-extracted, rather than the full-chip (a much faster and much
less difficult task).
o hierachial verification got easy, too
Each block was also run through Cadence Dracula3 DRC verification
(rev 4.6) in parallel with layout extraction with Avanti Star-RC. (The
project got very CPU hungry then!) We able to verify each block as
clean, with the top-level layout clean, requiring verification of only
DRC issues pertaining to interactions between top-level and block-level
layouts. This allowed us to tapeout after only one DRC check on the
completed GDS -- it had no errors after layout generation, requiring no
time-intensive iterations for DRC fixes. To verify the final layout,
we used no hierachy and ran full-chip LVS (Avanti Hercules) to ensure
no interface issues remained from assembly.
o transistor density increased 25%
Our transistor density increased 25% with FlexRoute because of lack of
significant channels and the clean, aligned routing that FlexRoute
performed. Nice. We iterated hundreds of times on the top-level
floorplan with pin changes and block size modifications, for
optimization of both top-level and block-level routing, as well as
overall timing results.
FlexRoute pin optimization was used numerous times to optimize the
top-level routing, resulting in a route with almost no channels and no
"switchboxes" (created when pins are not aligned and routes need to
be hooked up). Pin optimization combined with ultra-fast global
routing (approximately 30 seconds on our 23 block/7000+ net design)
allowed us to very quickly evaluate the chip floorplan for routability
and channel density issues before re-routing affected blocks.
I guesstimate a 10-20% greater density could have been achieved had
we pushed the floorplan harder.
Make, Timelines and Tapeout
---------------------------
Using this hierachial design approach let us iterate the entire chips'
layout in just over a day. Our steps were:
1.) designer creates Verilog
2.) synthesize with Design Compiler
3.) P&R with Silicon Ensemble
4.) layout extraction with Avanti Star-RC
5.) delay calculation using Ultima
6.) load it into PrimeTime
7.) analyze critical paths, etc. back to step 1 or 2 (depending)
Steps 3 through 6 were done automatically with a make file. Didn't have
to touch nor hand hold anything. This took typically 2 and 16 hours to
do depending on the block size with the largest being 120 Kgates.
Going steps 1 through 7, we could turn around the entire chip from new block
netlists to a top-level timing report in around 1.5 days.
For FlexRoute itself, our 3 Mgate ASIC with 23 top-level blocks and over
7,000 nets had routing runtimes of 30 minutes (for preliminary top-level
routes during pre-tapeout timing checks) and 60 minutes (for full top-level
routing with all preroutes and design data for tapeout-quality layout,
including tuned clocks.) By comparision, our old Compass tools used to
take 24 hours just to route a smaller sized design!
This massive improvement allowed us to iterate many more times for optimal
timing and area utilization than we were able to do before.
We iterated on the top-level clock network approximately 50 times for
the optimal skew characteristics. These changes were made in the FlexRoute
database (instead of in a layout database), -- so that new changes to the
netlist for timing and functionality did not disrupt the ongoing effort in
clock tuning the top-level network.
- Sam Appleton
SGI Moutain View, CA
|
|