( ESNUG 587 Item 5 ) ---------------------------------------------- [07/22/19]
EDITOR'S NOTE: I love love love deep technical "how to" and especially
"how NOT to" user posts like this! Now that's good DeepChip! - John
Subject: 12 good and 4 bad switches in new Genus/Innovus/Tempus 19.1 flow
Then we compared power, performance, area, (PPA) and runtime throughout
5 flows using the TSMC CLN7FF library:
Tools Flow Name Comment
---------------------------------- --------------- ------------------
DC-> test ins -> ICC2 -> PT SNPS-All our old SNPS flow
SNPS Fusion Compiler -> PT SNPS-New new SNPS flow
DC-> test ins -> Innovus -> PT Innovus-PT old Innovus flow
DC-> test ins -> Innovus -> Tempus Innovus-Tempus old Innovus+Tempus
Genus-> Modus -> Innovus -> Tempus CDNS-All CDNS only flow
Our goal was to have "Mongo" with 3.0 M inst, a number of ARM cores and
hard macros, and some very tight power requirements reach 3.2 GHz (or
better) in TSMC CLN7FF.
|
Flows
|
Best Frequency Achieved
|
TNS left on table
|
Total Power
|
TAT
|
|
SNPS-All
|
2.87 Ghz
|
97 nsec
|
1,838 mW
|
14.7 days
|
|
SNPS-New
|
2.67 Ghz
|
165 nsec
|
1,923 mW
|
12.4 days
|
|
Innovus-PT
|
3.06 Ghz
|
44 nsec
|
1,720 mW
|
11.7 days
|
|
Innovus-Tempus
|
3.12 Ghz
|
24 nsec
|
1,667 mW
|
9.8 days
|
|
CDNS-All
|
3.22 Ghz
|
0 nsec
|
1,586 mW
|
8.2 days
|
What we found is the "CDNS-All" flow consistently gave us better PPA in
the shortest runtime of any of the flows.
- from Benchmark of DC-ICC2 vs Fusion Compiler vs Genus-Innovus
From: [Ralph, an adult Mutant Ninja Turtle]
Hi, John,
We saw similar trends when we went from DC/ICC2 over to Genus/Innovus 18.1.
Right now we've just swiched over to Genus/Innovus/Tempus 19.1 and here's
how we made it work. First, here's the four new parts of the new CDNS 19.1
flow in a nutshell.
- new physical restructuring with iSpatial
- new Mux and Datapath Restructuring
- new Machine Learning based optimization
- new Tempus ECO that skips post-route optimization
Cadence has this idea of common engines. They used to have common placement
and common rout-ing engines across all the tools. With CDNS 19.1, we get
"iSpatial", which what Cadence marketing calls putting GigaOpt everywhere in
their flow to have one common optimization engine everywhere, too. (GigaOpt
is the optimization engine orginally in Innovus. Now it's in Genus and
Tempus, too.)
With iSpatial we see a ~1.8x runtime speed up for our full 19.1 flow vs. the
old 18.1 full flow. The new iSpatial predicted our area exactly, and our
power exactly (because it's GigaOpt moved upstream into Genus). This better
timing used by our RTL team to tune the RTL. So when this better 19.1 data
is taken to Innovus, Innovus doesn't have to do placement optimization
again. Innovus 19.1 goes purely incremental after that, which is what gets
us that 1.8x runtime speed up.
To enable iSpatial we needed to do the following switches:
set_db limited_access_feature {ispatial 214480224}
set_db opt_spatial_effort extreme
syn_opt -spatial
This is what we're using now because it's what the Cadence folks told us to
do. Are other users using different switches to turn on iSpatial?
---- ---- ---- ---- ---- ---- ----
On top of a 1.8x speed up, we also saw better QOR overall with iSpatial from
- Early Clocking: since we now get Gigaopt within Genus, it now uses
the early clocking flow in synthesis with useful skew. The clock
gates and skewed macros now start to show up early in synthesis.
- Physical Restructuring: the Genus mapper now gets to use GigaOpt to
make early (and clever) physical optimations during RTL synthesis.
Let's say our design has a cascade of adders 6 deep. The old 18.1
Genus can't see the whole cone of logic it works out to. But with
19.1, it can see the whole cone and restructure it optimally.
With this we are getting gains on QOR and more importantly on congestion.
|
|
18.1 Genus/Innovus
|
19.1 iSpatial flow
|
|
Timing WNS/TNS
|
-88psec/-265nsec
|
-32psec/-93nsec
|
|
Power
|
1,245 mW
|
1,173 mW
|
|
TTR/TAT
|
8.12 days
|
4.78 days
|
Here's the new switch to enable this restructuring
set_db limited_access_feature {ispatial_restructuring 439424160}
set_db opt_spatial_restructuring true
---- ---- ---- ---- ---- ---- ----
IMPORTANT: DO NOT RERUN INNOVUS
This new 19.1 restructuring flow above puts Innovus in incremental mode.
But you might lose the TAT benefit if you re-start Innovus again. So the
user has to watch out for that. So you have to make sure you turn on
your Innovus place optimization to the much faster incremental version:
setPlaceMode -place_global_exp_skip_gp true
place_opt_design
Cadence says as long the Genus dB is handed over to Innovus backend dB, we
will get this incremental behavior in Innovus. (This works because is
the GigaOpt stage is already done and the dB's automatically sense that.)
But it must be a dB handoff! If you DON'T do a dB handoff, then you MUST
use the two switches above.
---- ---- ---- ---- ---- ---- ----
NOW HAS MUX AND DATAPATH RESTRUCTURING IN RTL SYNTHESIS
We are using the 19.1 based Genus release. Our previous 18.1 methodology
used a low-level RTL coding style. This is pretty common and most synthesis
tools handle this well enough. However we are now doing chips with a lot of
machine learning logic content in them that requires the use of high level
RTL logic coded in System Verilog. For example:
max= 0
for (int i=0; i <4; i++) beginif (max < array [i])
begin
max = array [i]
end
end
This would expand to in a normal unrolling of the above loop out to
But the better solution is to expand it to
For 4 levels, this saves us a 1 level of logic. Now imagine this is with
a loop thats 16 levels deep. It would save 11 logic levels. This trade
off between levels of logic, area, power, timing is crucial for SV chips.
We turned this on in Genus 19.1 with the DP Turbo switch:
set_attr dp_opt_turbo true
We checked that the decisions 19.1 took were correct and it didn't need
any guidance to get the data path right.
Timing is important at 5nm, but our key QOR focus is on power. Getting
data path and MUX just right is important. This DP Turbo switch is new.
(I think Cadence calls this Compus. I don't think they've announced this
yet.) We like the fact that just one switch enables this, and not too
many options.
Cadence gave us a way to measure path depths:
|
Path Depths Range
|
# of Paths w/ 18.1
|
# of Paths w/ 19.1
plus high level opto
|
|
0-4 levels
|
22349
|
22785
|
|
5-9 levels
|
34987
|
35955
|
|
10-14 levels
|
46876
|
46903
|
|
15-19 levels
|
43875
|
44329
|
|
20-24 levels
|
74892
|
75309
|
|
25-29 levels
|
54985
|
60783
|
|
30-34 levels
|
18904
|
24509
|
|
35-39 levels
|
7854
|
1023
|
|
40-44 levels
|
5689
|
23
|
|
45-50 levels
|
1208
|
0
|
Notice with 19.1 the path depth shrinks at over 35 levels.
See above 35 with 18.1 there are 7,854 + 5,689 + 1,208 == 14,751 levels
See above 35 with 19.1 there are 1,023 + 23 + 0 == 1,046 levels
What 19.1 is doing is MUX and datapath restructing to shave off 15 levels.
We were using Genus for our high end CPU QOR. With 19.1 we now have one
tool for both CPU style designs and other designs. (As compared to SNPS
which uses Fusion Compiler for CPU and DC-NXT for all the non-CPU design;
and FC and DC-NXT have different UI's with different unique commands;
whereas CDNS 19.1 has the same commands throughout.)
---- ---- ---- ---- ---- ---- ----
CDNS 19.1 NOW HAS MACHINE LEARNING
We got new code from Cadence where they brag above having Machine Learning.
We found that it fixed their pre-route to post-route correlation problem.
It used to be when we went from 16nm to 7nm, 18.1 correction was 30% lower.
With ICC2, the correction was also 30% lower.
|
|
19.1 without ML
|
19.1 with ML
|
|
WNS/TNS
|
-9psec/-50psec
|
-10psec/-48psec
|
|
Power QOR
|
945 mW
|
911 mW
|
|
TTR/TAT
|
2.3 days
|
2.2 days
|
With CDNS 19.1 in training, we got ~5% aligned pre-route to post-route; and
that was close enough it gave us another 3.5% power savings. What's also
good is all our ML training design data resides on our own network and we
feel secure with that.
We enabled it using the following
setMachineLearningMode -training net_cell_delay
python <training_package_path>/run.py -mode train \
-type net \
-datadir <data_dir> \
-outdir <model_dir> \
-log train.log
setMachineLearningMode -deployment
As you can see the numbers weren't all that different with or without ML;
but fixing that pre-route to post-route 30% lowball correlation problem is
why we like the new ML in 19.1.
---- ---- ---- ---- ---- ---- ----
NEW 19.1 TEMPUS ECO SKIPS POST-ROUTE OPTIMIZATION
Cadence 19.1 is proposing PBA based optimization. (They're saying skip the
post-route optimization step altogether.) Since Tempus is integrated into
Innovus, with Tempus ECO we now get is PBA based optimization straight into
Innovus. Looks like they strengthened their timing driven routing.
With Machine Learning we get such good pre-route to post-route correlation,
we can directly move on to Tempus ECO power optimization as the final step
in our flow. All PBA based. The switch for it is:
signoffOptDesign -setup -hold -leakage -dynamic -drv -area
The above command got us PBA based optimization and replaced the old one:
optDesign -postRoute <-- BAD OLD COMMAND! DO NOT USE!!!
Important it is a must that you specify the process node setting.
setDesignMode -process 5
It's the one setting to make sure 5nm is enabled properly in 19.1.
---- ---- ---- ---- ---- ---- ----
BROKEN/DANGEROUS STUFF IN 19.1
We tried some other stuff that didn't get us much gain and would like to see
if other users are able to make this work.
1.) We tried hold aware scan chain reorder after inserting clocks.
setScanReorderMode -holdAware true
scanReorder
This didn't work well in reducing hold violations. Cadence needs
to fix this. This is just not good right now. It messed up our
scan chains and improved hold but the design was unroutable.
On another design it didn't change hold much.
2.) We have some glitch power left on the design to be optimized. We
tried the Joules engine for glitch power optimization - it does
good job calculating and reporting it, but we need to optimize it.
It literally cannot optimize it! Cadence says an optimizing Joules
is at least 3 to 6 months away.
3.) We had some functional ECO that might touch the clock network and
we ended up fixing transition violations on those clock nets
ourselves. Cadence told us about this command:
ccopt_pro
That command was not good. It destroyed and touched our whole
clock tree whereas we just wanted part of the clock network
optimized. STAY AWAY FROM THIS!
4.) We also tried to do restricted metal only ECO limited to 4 layers
using the following
ecoRoute -modifyOnlyLayers 5-8
It said it changed 4 layers but it touched the vias between the
layer 4 to layer 5 which cost us another layer change costs.
---- ---- ---- ---- ---- ---- ----
WE HAVE NOT TRIED VIRTUS YET
We are a Tempus/Voltus signoff flow. We are worried about true IR aware
signoff at 5nm and would like to be trying out Virtus in this rev of our
chip. We would like to find out in vectorless analysis which is our real
sensitive paths. (Apache isn't true vestorless yet.) Fingers crossed.
---- ---- ---- ---- ---- ---- ----
CONCLUSION
Our best high level 19.1 recipe (right now) seems to involve
- new physical restructuring with iSpatial
- new Mux and Datapath Restructuring
- new Machine Learning based optimization
- new Tempus ECO that skips post-route optimization
Would like Cadence to fix their hold aware scan chain reorder problem and
to get glitch power optimization working in Joules.
- [Ralph, an adult Mutant Ninja Turtle]
---- ---- ---- ---- ---- ---- ----
Related Articles
User benchmarks DC-ICC2 vs Fusion Compiler vs Genus-Innovus flows
Genus RTL synthesis gaining traction vs. DC is #4 of Best of 2017
After 16nm benchmark, 7nm user swaps out DC-Graphical for Genus-RTL
ICC2 patch rev, Innovus penetration, and the 10nm layout problem
Aart's SUE RIVALS policy backfires horribly on core SNPS patents
Join
Index
Next->Item
|
|